|Home | Articles | Forum | Glossary | Books|
4. SYSTEM TESTING
The topic of system test begins with the definition of what a system test is. One definition of a system is "the final product that the customer expects to operate as a functioning unit; it has a product ID number." Another definition is "an item in which there is no next level of assembly."
Products or systems run the gamut from the simple to the complex-from digital watches, calculators, electronic games, and personal computers at one end of the spectrum to automobiles, nonstop mainframe computers, patient heart monitors, and deep space probes at the other end. Testing these products is directly related to the level of product complexity. Simple products are easier and less costly to test than are complex products.
System testing contains elements of both design and manufacturing. System test is used to ensure that the interconnected and integrated assemblies, PWAs, modules, and power supplies work together and that the entire system/product functions in accordance with the requirements established by the global specification and the customer. For example, a Windows NT server system could consist of basic computer hardware, Windows NT server and clustering software, a Veritas volume manager, a RAID box, a SCSI adapter card, and communications controllers and drivers.
Many of the same issues incurred in IC and PWA test cascade to the system (finished or complete product) level as well. Since products begin and end at the system level, a clear understanding of the problems expected must emanate from the customer and/or marketing-generated product description document. The sys tem specifications must clearly state a test and diagnostic plan, anticipating difficult diagnostic issues (such as random failures that you know will occur); specify the expected defects and the specific test for these defects; and outline measurement strategies for defect coverage and the effectiveness of these tests.
Some of the issues and concerns that cause system test problems which must be considered in developing an effective system test strategy include Defects arise from PWA and lower levels of assembly as well as from system construction and propagate upward.
Interconnections are a major concern. In microprocessor-based systems, after the data flow leaves the microprocessor it is often affected by glitches on the buses or defects due to other components.
Timing compatibility between the various PWAs, assemblies, and modules can lead to violating various timing states, such as bus contention and crosstalk issues, to name two.
Effectiveness of software diagnostic debug and integration.
Design for test (DFT) is specified at the top of the hierarchy, but implemented at the bottom: the IC designer provides cures for board and sys tem test, and the board designer provides cures for system test.
ICs, PWAs, assemblies, and modules have various levels of testability de signed in (DFT) that may be sufficient for and at each of these levels.
However, their interconnectivity and effectiveness when taken together can cause both timing and testing nightmares. Also the level of DFT implementation may be anywhere from the causal and careless to the comprehensive.
System failures and intermittent issues can result in shortened component (IC) life.
Computer systems are so complex that no one has figured out how to design out or test for all the potential timing and race conditions, unexpected interactions, and non-repeatable transient states that occur in the real world.
Interaction between hardware and software is a fundamental concern.
Table (coming soon) 8 System Test Defects
Connectors-opens, shorts, intermittent Software integrity Electrostatic discharge and electrical overstress Errors in wiring Timing errors due to subtle defects in timing paths Leakage paths-solder slivers, foreign material Marginal components Improper or poor integrity connections Incompatibility between subassemblies Crosstalk Cables pinched or cut Jumper wires missing or in wrong configuration Serial operation of components that individually are within specification but collectively cause failure Integration of millions of gates (not concerned about individual components but interaction of all components and their variation)
A list of some typical system test defects is presented in Table 8. This list does not contain those manufacturing and workmanship defects that occur during PWA manufacturing and as detected by PWA visual inspection, ICT, and functional test: solder issues, missing and reversed components, and various PCB issues (vias and plated through-holes, for example). Since system test is so varied and complex, there is no way to address and do justice to the myriad unique issues that arise and must be addressed and solved. System testing includes not only electrical testing, but also run-in test-also called system burn-in (e.g., 72-hr run-in at 25°C, or 48-hr at 50°C)-that checks for product stability and facilitates exercising the product with the full diagnostic test set. The same lessons learned from PWA testing apply to systems test as well. Once a system/product success fully completes system test it can be shipped to the customer.
5. FIELD DATA COLLECTION AND ANALYSIS
System problems come from many sources that include environmental, application, individual component, and integration issues, for example. The environmental causes of failure in electronic systems include temperature, vibration, humidity, and dust, with more than half of the failures being attributed to thermal effects (Figure 27). Surface mount assemblies are particularly sensitive to thermal conditions whereby temperature cycling and power on/off cycles induce cycling strains and stresses in the solder interconnects. In general, vibration is less damaging to solder joints because high-frequency mechanical cycling does not allow enough time for creep strains to develop in solder joints.
Component failures are a concern and were briefly discussed in Section 4.
Then there is the issue of application-related problems that must be considered. Some examples include software-hardware integration and interaction with power supplies; disk drives and other equipment; the timing issues that occur when interconnecting multiple ICs with different setup and hold conditions; and the interaction of the different system PWAs with one another.
The best understanding of customer issues and thus product reliability comes from gathering and analyzing product field data. An analysis of field reliability data indicates:
The accuracy of the predictions made and the effectiveness of the reliability tests conducted
Whether established goals are being met
Whether product reliability is improving
In addition to measuring product reliability, field data are used to determine the effectiveness of the design and manufacturing processes, to correct problems in existing products, and to feed back corrective action into the design process for new products. Regular reports are issued disseminating both good and bad news from the field. The results are used to drive corrective actions in design, manufacturing, component procurement, and supplier performance.
A comprehensive data storage system is used to determine accurate operation times for each field replaceable unit (FRU) and allows identification of a field problem that results in a unit being replaced, as well as the diagnosis and repair actions on the unit that caused the problem. Since the installation and removal dates are recorded for each unit, the analysis can be based on actual run times, instead of estimates based on ship and return dates.
At the Tandem Division of Compaq Computer Corp. data are extracted from three interlinked databases (Figure 28). Each individual FRU is tracked from cradle to grave, i.e., from its original ship date through installation in the field and, if removed from a system in the field, through removal date and repair.
From these data, average removal rates, failure rates, and MTBF are computed.
Individual time-to-fail data are extracted and plotted as multiply processed data on a Weibull hazard plot to expose symptoms of wearout.
Using the data collection system shown in Figure 28, a typical part replacement rate (PRR) is computed by combining installation data from the Installed Systems database and field removal data from the Field Service Actions database.
The data are plotted to show actual field reliability performance as a function of time versus the design goal. If the product does not meet the goal, a root cause analysis process is initiated and appropriate corrective action is implemented. An example of the PRR for a disk controller is plotted versus time in Figure 29 using a 3-month rolling average.
Figure 30 shows a 3-month rolling average part replacement rate for a product that exhibited several failure mechanisms that had not been anticipated during the preproduction phase. Corrective action included re-specification of a critical component, PWA layout improvement, and firmware updates. The new revision was tracked separately and the difference was dramatically demonstrated by the resultant field performance data. The old version continued to exhibit an unsatisfactory failure rate, but the new version was immediately seen to be more reliable and quickly settled down to its steady state value, where it remained until the end of its life.
Power supplies are particularly troublesome modules. As such much data are gathered regarding their field performance. Figure 31 plots the quantity of a given power supply installed per month in a fielded mainframe computer (a), units removed per month from the field (b), and run hours in the field (c). The PRR and MTBF for the same power supply are plotted in Figure 32a and b, respectively.
Certain products such as power supplies, disk drives, and fans exhibit known wearout failure mechanisms. Weibull hazard analysis is performed on a regular basis for these types of products to detect signs of premature wearout.
To perform a Weibull analysis, run times must be known for all survivors as well as for removals. A spreadsheet macro can be used to compute the hazard rates and plot a log-log graph of cumulative hazard rate against run time. Figure 33 is a Weibull hazard plot of a particular disk drive, showing significant premature wearout. This disk drive began to show signs of wearout after 1 year (8760 hr) in the field, with the trend being obvious at about 20,000 hr. Field tracking confirmed the necessity for action and then verified that the corrective action implemented was effective.
Figure 34 is the Weibull hazard plot for the example power supply of Figures 31 and 32. This plot has a slope slightly less than 1, indicating a constant to decreasing failure rate for the power supply. Much effort was expended over an 18-month period in understanding the root cause of myriad problems with this supply and implementing appropriate corrective actions.
(From Ref. 1. Courtesy of the Tandem Division of Compaq Computer Corporation Reliability Engineering Department.)
6. FAILURE ANALYSIS
Failures, which are an inherent part of the electronics industry as a result of rapidly growing IC and PWA complexity and fast time to market, can have a severe financial impact. Consequently, failures must be understood and corrective actions taken quickly. Figure 35 shows some of the constituent elements for the last block of Figure 1 from Section 1-the customer-as relates to problems experienced with a product/system. In the field or during system test, the customer support engineer/field service engineer or manufacturing/hardware engineer, respectively, wastes no time in getting a system back online by replacing a PWA, power supply, or disk drive, for example. So typically the real investigative work of problem resolution starts at the PWA level. In some cases, because of the complexity of the ICs, it is the circuit designers themselves who must perform failure analysis to identify design and manufacturing problems.
Section 2.5 on lessons learned initiated the discussion of the investigative troubleshooting problem resolution process. The point was made that this process is all about risk evaluation, containment, and management. Problems that surface during development, manufacturing, and ESS as well as in the field need to be analyzed to determine the appropriate corrective action with this in mind. A metaphor for this process is a homicide crime scene. The police detective at the crime scene takes ownership and follows the case from the crime scene to interviewing witnesses to the autopsy to the various labs to court. This is much like what occurs in resolving a product anomaly. Another metaphor is that of an onion.
The entire problem resolution process is like taking the layers off of an onion, working from the outside layer, or highest level, to the core, or root cause. This is depicted in Table 9.
If the PWA problem is traced to a component, then it first must be deter mined if the component is defective. If the component is defective, a failure analysis is conducted. Failure analysis is the investigation of components that fail during manufacturing, testing, or in the field to determine the root cause of the failure. This is an important point because corrective action cannot be effective without information as to the root cause of the failure. Failure analysis locates specific failure sites and determines if a design mistake, process problem, material defect, or some type of induced damage (such as operating the devices outside the maximum specification limits or other misapplication) caused it. A shortcoming of failure analysis to the component is that the process does not provide any information regarding adjacent components on the PWA.
A mandatory part of the failure analysis process is the prompt (timely) feedback of the analysis results to determine an appropriate course of action.
Depending on the outcome of the failure analysis, the supplier, the OEM, the CM, or all three working together must implement a course of containment and corrective action to prevent recurrence. Proper attention to this process results in improved manufacturing yields, fewer field problems, and increased reliability.
Failure analysis involves more than simply opening the package and looking inside. Failure mechanisms are complex and varied, so it is necessary to perform a logical sequence of operations and examinations to discover the origin of the problem. The failure analysis of an IC is accomplished by combining a series of electrical and physical steps aimed at localizing and identifying the ultimate cause of failure (see Fig. 36). The process of Figure 36 is shown in serial form for simplicity. However, due to the widely varying nature of components, failures, and defect mechanisms, a typical analysis involves many iterative loops between the steps shown. Identifying the failure mechanism requires an understanding of IC manufacturing and analysis techniques and a sound knowledge of the technology, physics, and chemistry of the devices plus a knowledge of the working conditions during use. Let's look at each of the steps of Figure 36 in greater detail.
Table (coming soon) 9 Layers of the IC Problem Resolution Process
What happened? How does the problem manifest itself? Failure mode Diagnostic or test result (customer site and IC supplier in coming test)? How is the problem isolated to a specific IC (customer site)? What is the measurement or characterization of the problem? Failure mechanism Physical defect or nonconformity of the IC? What is the actual anomaly that correlates to the failure mode? Failure cause Explanation of the direct origin or source of the defect.
How was the defect created? What event promoted or enabled this defect? Root cause Description of the initial circumstances that can be attached to this problem.
Why did this problem happen?
Data line error Leakage on pin 31 Oxide rupture ESD Improper wrist strap use
FIGURE 36 Simplified diagram of IC failure analysis process.
The first and most critical step in the failure analysis process is fault localization.
Without knowing where to look on a complex VLSI component, the odds against locating and identifying a defect mechanism are astronomical. The problem is like the familiar needle in the haystack.
Because of the size and complexity of modern VLSI and ULSI components, along with the nano-metric size of defects, it is imperative to accurately localize faults prior to any destructive analysis. Defects can be localized to the nearest logic block or circuit net or directly to the physical location of the responsible defect. There are two primary methods of fault localization: hardware-based diagnostics using physical parameters like light, heat, or electron-beam radiation, and software-based diagnostics using simulation and electrical tester (ATE) data.
Hardware diagnostic techniques are classified in two broad categories. The first is the direct observation of a physical phenomenon associated with the defect and its effects on the chip's operation. The second is the measurement of the chip's response to an outside physical stimulus, which correlates to the instantaneous location of that stimulus at the time of response. While a fault can some times be isolated directly to the defect site, there are two primary limitations of hardware diagnostics.
The first is that the techniques are defect dependent. Not all defects emit light or cause localized heating. Some are not light sensitive nor will they cause a signal change that can be imaged with an electron beam. As such it is often necessary to apply a series of techniques, not knowing ahead of time what the defect mechanism is. Because of this, it can often take considerable time to localize a defect.
The second and most serious limitation is the necessity for access to the chip's transistors and internal wiring. In every case, the appropriate detection equipment or stimulation beam must be able to view or irradiate the site of interest, respectively. With the increasing number of metal interconnect layers and the corresponding dielectric layers and the use of flip chip packaging, the only way to get to the individual transistor is through the backside of the die.
Software diagnostics are techniques that rely on the combination of fault simulation results and chip design data to determine probable fault locations.
While it is possible to do this by manually analyzing failure patterns, it is impractical for ICs of even moderate complexity. Software diagnostics are generally categorized in two groups that both involve simulation of faults and test results: precalculated fault dictionaries and posttest fault simulation.
Once the fault has been localized as accurately as possible the sample must be prepared for further characterization and inspection. At this stage the chip usually needs to first be removed from its package. Depending on the accuracy of fault localization and the nature of the failure, perhaps multiple levels of the inter-level insulating films and metal wiring may need to be sequentially inspected and re moved. The process continues until the defect is electrically and physically isolated to where it is best identified and characterized.
To a great extent de-processing is a reversal of the manufacturing process; films are removed in reverse order of application. Many of the same chemicals and processes used in manufacturing to define shapes and structures are also used in the failure analysis laboratory, such as mechanical polishing, plasma or dry etching, and wet chemical etching.
Again, depending on the accuracy of fault localization and the nature of the failure, a second localization step or characterization of the defect may be necessary.
At this point the defect may be localized to a circuit block like a NAND gate, latch, or memory cell. By characterizing the effects of the defect on the circuit's performance it may be possible to further pinpoint its location. Because the subsequent steps are irreversible it is important to gather as much information as possible about the defect and its location before proceeding with the failure analysis.
A number of tools and techniques exist to facilitate defect localization and characterization. Both optical source and micrometer-driven positioners with ultrafine probes (with tips having diameters of approximately 0.2 µm) are used to inject and measure signals on conductors of interest. High-resolution optical microscopes with long working-distance objectives are required to observe and position the probes. Signals can be DC or AC. Measurement resolution of tens of millivolts or pico-amperes is often required. Because of shrinking linewidths it has become necessary to use a focused ion beam (FIB) tool to create localized probe pads on the nodes of interest. A scanning probe microscope (SPM) may be used to measure the effects of the defect on electrostatic force, atomic force, or capacitance. A number of other techniques are used for fault localization based on the specific situation and need. These are based on the use of light, heat, or electron-beam radiation.
Inspection and Defect Characterization
After exhausting all appropriate means to localize and characterize a fault, the sample is inspected for a physical defect. Once identified the physical defect must often be characterized for its material properties to provide the IC manufacturing line with enough information to determine its source.
Depending on the accuracy of localization, the fail site is inspected using one of three widely used techniques: optical, scanning-electron, or scanning probe microscopy. Optical microscopy scans for anomalies on relatively long wires or large circuit blocks (latches, SRAM cells, etc.). While relatively inadequate for high-magnification imaging, optical microscopy is superior for its ability to simultaneously image numerous vertical levels. Nanometer-scale resolution is attained with scanning electron microscopy (SEM). In addition to its high magnification capabilities, SEM can evaluate material properties such as atomic weight and chemical content. However, it is limited to surface imaging, requiring de-layering of films between inspection steps. For defects localized to extremely small areas (individual transistors, dynamic memory cell capacitors, etc.) a scanning probe microscope (SPM) can be used. This technique offers atomic-scale resolution and can characterize electrostatic potential, capacitance, or atomic force across small areas.
When these techniques cannot determine the material composition of the defect or are unable to locate a defect altogether, a suite of chemical and material analysis tools are utilized, e.g., transmission electron microscopy (TEM), auger electron spectroscopy (AES), and electron spectroscopy for chemical analysis (ESCA), to name several.
Reiterating, once the defect and its cause are identified, the information is fed back up the entire electronics design/manufacture chain to allow all members to identify risk and determine containment and a workaround solution (often implementing some sort of screening) until permanent corrective action is instituted. The need for this feedback and open communication cannot be stressed enough.
Integrated circuit failure analysis is a time-consuming and costly process that is not applicable or required for all products or markets. For lower cost consumer and/or disposable products or for personal computers, failure analysis is not appropriate. For mainframe computers, military/aerospace, and automotive applications, failure analysis of problem or failing components is mandatory.
Portions of Section 1.2 excerpted from Ref. 1.
Much of the material for Section 2.4 comes from Refs. 2, 3 and 6.
Portions of Section 3 excerpted from Ref. 7, courtesy of the Tandem Division of the Compaq Computer Corporation Reliability Engineering Department.
Portions of Section 3.1 excerpted from Ref. 8.
1. Hnatek ER, Russeau JB. PWA contract manufacturer selection and qualification, or the care and feeding of contract manufacturers. Proceedings of the Military/Aerospace COTs Conference, Albuquerque, NM, 1998, pp 61-77.
2. Hnatek ER, Kyser EL. Straight facts about accelerated stress testing (HALT and ESS)-lessons learned. Proceedings of The Institute of Environmental Sciences and Technology, 1998, pp 275-282.
3. Hnatek ER, Kyser EL. Practical Lessons Learned from Overstress Testing-a Historical Perspective, EEP Vol. 26-2: Advances in Electronic Packaging-1999. ASME 1999, pp 1173-1180.
4. Magaziner I, Patinkin M. The Silent War, Random House Publishers, 1989.
5. Lalli V. Space-system reliability: a historical perspective. IEEE Trans Reliability 47(3), 1998.
6. Roettgering M, Kyser E. A Decision Process for Accelerated Stress Testing, EEP Vol. 26-2: Advances in Electronic Packaging-1999, Vol. 2. ASME, 1999, pp 1213- 1219.
7. Elerath JG et al. Reliability management and engineering in a commercial computer environment. Proceedings of the International Symposium on Product Quality and Integrity, pp 323-329.
8. Vallet D. An overview of CMOS VLSI/failure analysis and the importance of test and diagnostics. International Test Conference, ITC Lecture Series II, October 22, 1996.
1. Albee A. Backdrive current-sensing techniques provide ICT benefits. Evaluation Engineering Magazine, February 2002.
2. Antony J et al. 10 steps to optimal production. Quality, September 2001.
3. Carbone J. Involve buyers. Purchasing, March 21, 2002.
4. IEEE Components, Packaging and Manufacturing Technology Society Workshops on Accelerated Stress Testing Proceedings.
5. International Symposium for Testing and Failure Analysis Proceedings.
6. Kierkus M and Suttie R. Combining x-ray and ICT strategies lowers costs. Evaluation Engineering, September 2002.
7. LeBlond C. Combining AOI and AXI, the best of both worlds. SMT, March 2002.
8. Prasad R. AOI, Test and repair: waste of money? SMT, April 2002.
9. Radiation-induced soft errors in silicon components and computer systems. International Reliability Symposium Tutorial, 2002.
10. Ross RJ et al. Microelectronic Failure Analysis Desk Reference. ASM International, 1999 and 2001 Supplement.
11. Scheiber S. The economics of x-rays. Test & Measurement World, February 2001.
12. Serant E, Sullivan L. EMS taking up demand creation role. Electronic Buyers News, October 1, 2001.
13. Sexton J. Accepting the PCB test and inspection challenge. SMT, April 2001.
14. Verma A, Hannon P. Changing times in test strategy development. Electronic Pack aging and Production, May 2002.