Guide to Reliability of Electrical/Electronic Equipment and Products--Manufacturing/Production Practices (part 2)

Home | Articles | Forum | Glossary | Books


<<PREV.

3. ENVIRONMENTAL STRESS SCREENING

Once a product has been released to production manufacturing, most of the component, design, and manufacturing issues should be resolved. Depending on market requirements and product and process maturity, environmental stress screening (ESS) can be used to quickly identify latent component, manufacturing, and workmanship issues that could later cause failure at the customer's site. Optimal ESS assumes that design defects and margin issues have been identified and corrected through implementation of accelerated stress testing at design.

Also called accelerated stress testing, ESS has been extensively used for product improvement, for product qualification, and for improving manufacturing yields for several decades. In the early 1980s, computer manufacturers performed various stress screens on their printed wiring assemblies. For example, in the 1981-1983 timeframe Apple Computer noticed that "a number of their boards would suddenly fail in the midst of manufacturing, slowing things down. It happened to other computer makers, too. In the industry it had been seen as an accepted bottleneck. It was the nature of the technology-certain boards have weak links buried in places almost impossible to find. Apple, working with out side equipment suppliers, designed a mass-production PWA burn-in system that would give the boards a quick, simulated three month test-a fast burn-in-the bad ones would surface" (3). The result was that this "burn-in system brought Apple a leap in product quality" (4).

Now a shift has taken place. As the electronics industry has matured, components have become more reliable. Thus, since the late 1980s and early 1990s the quality focus for electronic equipment has moved from individual components to the attachment of these components to the PCB. The focus of screening has changed as well, migrating to the PWA and product levels. The cause of failure today is now much more likely to be due to system failures, hardware-software interactions, workmanship and handling issues (mechanical defects and ESD, for example), and problems with other system components/modules such as connectors and power supplies. Stress screening is an efficient method of finding faults at the final assembly or product stage, using the ubiquitous stresses of temperature, vibration, and humidity, among others.


FIGURE 8 Environmental stress screening perspective. C/A, Corrective action; OEM, original equipment manufacturer; final ET QA, electrical test; EOS, electrical overstress; ESD, electrical static discharge.


FIGURE 9 Accelerated stress testing uses substantially higher than normal specification limits.

The lowest cost of failure point has moved from the component level to board or PWA test, as shown in Figure 7 of Section 4. This is currently where screening can have the greatest benefit in driving product improvement. As was the case for components, increasing reliability mitigates the necessity for screening.

The decision process used to initiate or terminate screening should include technical and economic variables. Mature products will be less likely candidates for screening than new products using new technology. Given the decision to apply ESS to a new product and/or technology, as the product matures, ESS should be withdrawn, assuming a robust field data collection-failure analysis-corrective action process is in place. Figure 8 puts the entire issue of ESS into perspective by spanning the range of test from ICs to systems and showing the current use of each. Figure 1 of Ref. 5 shows a similar trend in reliability emphasis.

Much has been written in the technical literature over the past 10 years regarding accelerated stress testing of PWAs, modules, and power supplies. The philosophy of accelerated stress tests is best summarized by the following: The system reliability of a complex electronic product can be improved by screening out those units most vulnerable to environmental stress.

What Is Environmental Stress Screening?

Environmental stress screening applies stresses that are substantially higher than those experienced in normal product use and shipping (Fig. 9); product weak nesses and variations are better revealed at heightened stress levels. The technique may also include stresses that do not occur in normal use. The requirements on the stress stimuli are

1. They are severe enough to precipitate and detect relevant latent defects, i.e., the family of defects that show up in normal use.

2. They don't damage good products, causing early wearout and reduced life in the field.

The Institute of Environmental Sciences and Technology (IEST) definition of environmental stress screening is as follows: "Environmental stress screening is a process which involves the application of a specific type of environmental stress, on an accelerated basis, but within design capability, in an attempt to surface latent or incipient hardware flaws which, if undetected, would in all likelihood manifest themselves in the operational or field environment." Simply stated, ESS is the application of environmental stimuli to precipitate latent defects into detectable failures.

The stress screening process thus prevents defective products from being shipped to the field and offers the opportunity to discover and correct product weaknesses early in a product's life.

Why Perform Environmental Stress Screening?

Manufacturing processes and yields improve with the application of ESS because it precipitates out intermittent or latent defects that may not be caught by other forms of testing. It thus prevents these defects from causing long-term reliability problems following shipment to the customer. It identifies workmanship, manufacturing, and handling problems such as solder joint defects, poor interconnections (sockets and connectors), marginal adhesive bonding, and material defects.

Thus, effective ESS:

1. Precipitates relevant defects at minimal cost and in minimal time

2. Initiates a closed-loop failure analysis and corrective action process for all defects found during screening

3. Increases field reliability by reducing infant mortalities or early life failures

4. Decreases the total cost of production, screening, maintenance, and warranty

==============

Table (coming soon) 5 Screening Environments Versus Typical PWA Failure Mechanisms Detected

Thermal cycling

Component parameter drift PCB opens/shorts Component incorrectly installed Wrong component Hermetic seal fracture

Chemical contamination Defective harness termination Poor bonding Hairline cracks in parts Out-of-tolerance parts

Vibration

Particle contamination Chafed/pinched wires Defective crystals Adjacent boards rubbing Components shorting Loose wire Poorly bonded part Inadequately secured high-mass parts Mechanical flaws Part mounting problems Improperly seated connectors

Thermal cycling and vibration

Defective solder joints

Loose hardware

Defective components

Fasteners

Broken part

Defective PCB etch

==============

Table (coming soon) 6 Combined Environment Profile Benefits and Accelerating Factors in Environmental Stress Screening

Combined environments are truer to actual use conditions.

Hot materials are looser/softer so vibration at high temperatures creates larger motions and displacements, accelerating material fatigue and interfacial separation.

Many materials (like rubber and plastics) are stiffer or brittle when cold; vibration stress at cold temperature can produce plasticity and brittle fracture cracks.

Six-axis degree of freedom vibration excites multiple structural resonant modes.

Overstress drives materials further along the S-N fatigue curve.

Thermal expansion/contraction relative movements during temperature cycling combined with relative movements from random vibration also accelerate material fatigue.

==============

Environmental Stress Screening Profiles

Environmental stress screening applies one or more types of stress including random vibration, temperature, temperature cycling, humidity, voltage stressing, and power cycling on an accelerated basis to expose latent defects. Table 5 pro vides a listing of the typical failure mechanisms accelerated and detected by application of thermal cycling, random vibration, and combined thermal cycling and random vibration environments. The most effective screening process uses a combination of environmental stresses. Rapid thermal cycling and triaxial six-degree of freedom (omniaxial) random vibration have been found to be effective screens.

Rapid thermal cycling subjects a product to fast and large temperature variations, applying an equal amount of stress to all areas of the product. Failure mechanisms such as component parameter drift, PCB opens and shorts, defective solder joints, defective components, hermetic seals failure, and improperly made crimps are precipitated by thermal cycling. It has been found that the hot-to-cold temperature excursion during temperature cycling is most effective in precipitating early life failures.

Random vibration looks at a different set of problems than temperature or voltage stressing and is focused more on manufacturing and workmanship defects. The shift to surface mount technology and the increasing use of CMs make it important to monitor the manufacturing process. Industry experience shows that 20% more failures are detected when random vibration is added to thermal cycling. Random vibration should be performed before thermal cycling since this sequence has been found to be most effective in precipitating defects. Random vibration is also a good screen to see how the PWA withstands the normally encountered shipping and handling vibration stresses. Table 6 summarizes some benefits of a combined temperature cycling and six-axis (degree of freedom) random vibration ESS profile and lists ways in which the combined profile precipitates latent defects.


FIGURE 10 Environmental stress screening profile for high-end computer server. GPR (general product requirements) _ operating environment limits.

Once detected, the cause of defects must be eliminated through a failure analysis to root cause-corrective action implementation-verification of improvement process. More about this later.

The results of highly accelerated life testing (HALT) (Section 3), which was conducted during the product design phase, are used to determine the ESS profile for a given product, which is applied as part of the normal manufacturing process. The ESS profile selected for a given product must be based on a practical, common-sense approach to the failures encountered, usage environment expected, and costs incurred. The proper application of ESS will ensure that the product can be purged of latent defects that testing to product specifications will miss.

Figure 10 shows an example of a HALT profile leading to an ESS profile compared with the product design specification general product requirements (GPR) for a high-end computer server. From this figure you can see that the screen provides accelerated stress compared to the specified product environment.

Recently compiled data show that at PWA functional test, 75% of defects detected are directly attributed to manufacturing workmanship issues. Only 14% of failures are caused by defective components or ESD handling issues, according to EIQC Benchmark.

Figure 11 shows how a selected ESS profile was changed for a specific CPU after evaluating the screening results. The dark curve represents the original ESS profile, and the portions or sections labeled A, B, and C identify various circuit sensitivities that were discovered during the iterative ESS process. For section A, the temperature sensitivity of the CPU chip was causing ESS failures that were judged unlikely to occur in the field. The screen temperature was lowered while an engineering solution could be implemented to correct the problem.

In section B the vibration level was reduced to avoid failing tall hand-inserted capacitors whose soldering was inconsistent. Section C represented a memory chip that was marginal at low temperature. Each of the problems uncovered in these revisions was worked through a root cause/physics of failure-corrective action process, and a less stressful ESS screen was used until the identified improvements were implemented.


FIGURE 11 Adjusting the ESS profile.

Environmental Stress Screening Results

The results achieved by careful and thoughtful implementation of ESS are best illustrated by means of several case studies. The first three case studies presented here are somewhat sparse in the description of the ESS process (because the implementation of ESS at these companies was in its infancy), yet they serve to illustrate how the implementation of ESS led to positive product improvements.

The two other case studies contain more detail and are written in a prose style because they come from a source where ESS has been a way of life for a long period of time.

Case Studies

Case Study 1: Computer Terminal. Relevant details were as follows:

1000 components per module.

Two years production maturity.

75 modules screened.

Temperature cycling and random vibration screens were used.

Results:

Initial screens were ineffective.

Final screens exposed defects in 10-20% of the product.

Failure spectrum:

Parts: 60% (mainly capacitors) Workmanship: 40% (solder defects)

Case Study 2: Power Supply. Relevant details were as follows:

29 manufacturing prototypes screened.

200 components per module.

Test, analyze, fix, and text (TAFT) approach was used to drive improvement.

Temperature, vibration, and humidity screens were used.

Results: Defects exposed in 55% of the product:

Workmanship: cold solder joints, bent heat sinks, and loose screws

Design issues: components selected and poor documentation

Case Study 3: Large Power Supply. Relevant details were as follows:

4500 components per power supply.

One unit consists of 75 modules.

Two years production maturity.

75 units were screened.

Temperature cycling was the screen of choice.

Results:

Initially, defects exposed in 25% of the product. This was reduced through the TAFT process to _3%.

Concurrently mean time between failures (MTBF) was improved from 40 hr at the beginning of screening to _20,000 hr.

Improvements were achieved through minimal design changes, improved manufacturing processes, and upgrading components used and controls exercised.

Case Study 4: Complex Communications PWA. A CPU board just entering production was subjected to the ESS profile shown in the lower portion of Figure 12. The upper portion shows the type and number of failures and when and where they occurred in the ESS cycle.


FIGURE 12 Environmental stress screening failures by type and screen time.


FIGURE 13 Failure probability as a function of temperature.


FIGURE 14 Failure rates for two software versions.

The failure categories in Figure 12 are derived from console error messages generated by the computer's operating system. Each error (fault) is traced to the component level; the defective component is replaced; and the board is retested.

From Figure 12 we can see which applied stress tests result in what failures.

Note that the Comm Logic error occurs exclusively during the 60°C temperature dwell. Other error groupings suggest that the initial temperature ramp-down and the first vibration segment are also effective precipitators of ESS failures.

The Comm Logic failure was corrected by replacing the Comm Logic ASIC; and then the board was retested. Using the data from the initial 10 boards tested at 80°C and the other results, a probability distribution function (PDF) for this fault as a function of temperature was constructed, as shown in Figure 13.


FIGURE 15 Printed wiring assembly ESS yields.


FIGURE 16 Manufacturing yield by PWA type for 3Q97.


FIGURE 17 Yield data for CPU A (mature CPU PWA).

The results shown in Figure 13 were particularly disturbing since a small but significant probability of failure was predicted for temperatures in the normal operating region. The problem could be approached either by reworking the Comm Logic ASIC or by attempting a software workaround.

The final solution, which was successful, was a software revision. Figure 14 also shows the results obtained by upgrading system software from the old version to the corrected version. In Figure 14, the failures have been converted to rates to show direct comparison between the two software versions. Note that the Comm Logic error has been completely eliminated and that the remaining error rates have been significantly diminished. This result, completely unexpected (and a significant lesson learned), shows the interdependence of software and hardware in causing and correcting CPU errors.

Case Study 5: Family of CPU PWAs. Figure 15 is a bar chart showing ESS manufacturing yields tracked on a quarterly basis. Each bar is a composite yield for five CPU products. Note that the ESS yield is fairly constant. As process and component problems were solved, new problems emerged and were ad dressed. In this case, given the complexity of the products, 100% ESS was required for the entire life of each product.

Figure 16 shows a detailed breakout of the 3Q97 ESS results shown in the last bar of Figure 15, and adds pre-ESS yields for the five products in production that make up the 3Q97 bar. This chart shows the value of conducting ESS in production and the potential impact of loss in system test or the field if ESS were not conducted. Notice the high ESS yield of mature PWAs (numbers 1-3) but the low ESS yield of new boards (4 and 5), showing the benefit of ESS for new products. Also, note particularly that the post-ESS yields for both mature and immature products are equivalent, indicating that ESS is finding the latent defects.

Figure 16 also shows that the value of ESS must be constantly evaluated. At some point in time when yield is stable and high, it may make sense to discontinue its use for that PWA/product. Potential candidates for terminating ESS are PWA numbers 1-3.

Figures 17-20 show the results of ESS applied to another group of CPU PWAs expressed in terms of manufacturing yield. Figures 17 and 18 show the ESS yields of mature PWAs, while Figures 19 and 20 show the yields for new CPU designs. From these figures, it can be seen that there is room for improvement for all CPUs, but noticeably so for the new CPU designs of Figures 19 and 20. Figures 17 and 18 raise the question of what yield is good enough before we cease ESS on a 100% basis and go to lot testing, skip lot testing, or cease testing altogether. The data also show that ESS has the opportunity to provide real product improvement.


FIGURE 18 Yield data for CPU B (mature CPU PWA).


FIGURE 19 Yield data for CPU C (new CPU PWA design).


FIGURE 20 Yield data for CPU D (new CPU PWA design).

The data presented in Figures 17 through 20 are for complex high-end CPUs. In the past, technology development and implementation were driven primarily by high-end applications. Today, another shift is taking place; technology is being driven by the need for miniaturization, short product development times (6 months) and short product life cycles (_18 months), fast time to market, and consumer applications. Products are becoming more complex and use complex ICs. We have increasing hardware complexity and software complexity and their interactions. All products will exhibit design- or process-induced faults. The question that all product manufacturers must answer is how many of these will we allow to get to the field. Given all of this, an effective manufacturing defect test strategy as well as end-of-line functional checks are virtually mandated.


FIGURE 21 The bathtub failure rate curve.


FIGURE 22 Mapping ESS and field failure distributions.


FIGURE 23 Early failure rate improvement achieved by using ESS.


FIGURE 24 Field failure distributions of different CPU products.

Environmental Stress Screening and the Bathtub Curve

The system reliability of a complex electronic product can be improved by screening out those units most vulnerable to various environmental stresses. The impact of environmental stress screening on product reliability is best seen by referring to the bathtub failure rate curve for electronic products reproduced in Figure 21.

The improved quality and reliability of ICs has reduced the infant mortality (early life) failure rate in PWAs over the past 10-15 years. Application of ESS during manufacturing can further reduce early life failures, as shown in the figure.

The question that needs to be answered is how do failures from an ESS stress-to-failure distribution map onto the field failure time-to-failure distribution.

Figure 22 graphically depicts this question. Failures during the useful life (often called the steady state) region can be reduced by proper application of the ESS- failure analysis to root cause-implement corrective action-verify improvement (or test-analyze-fix-test) process. The stress-to-fail graph of Figure 22 indicates that about 10-15% of the total population of PWAs subjected to ESS fail. The impact of this on time to failure (hazard rate) is shown in the graph on the right of that figure.

Implementing an ESS manufacturing strategy reduces (improves) the infant mortality rate. Field failure data corroborating this for a complex computer server CPU are shown in Figure 23. Data were gathered for identical CPUs (same design revision), half of which were shipped to customers without ESS and half with ESS. The reason for this is that an ESS manufacturing process was implemented in the middle of the manufacturing life of the product, so it was relatively easy to obtain comparison data holding all factors except ESS constant. The top curve shows failure data for PWAs not receiving ESS, the bottom curve for PWAs receiving ESS.

Figure 24 shows field failure data for five additional CPU products. These data support the first two regions of the generic failure curve in Figure 21. Figure 24 shows the improvements from one generation to succeeding generations of a product family (going from CPU A to CPU E) in part replacement rate (or failure rate) as a direct result of a twofold test strategy: HALT is utilized during the product design phase and an ESS strategy is used in manufacturing with the resultant lessons learned being applied to improve the product.

One of the striking conclusions obtained from Figure 24 is that there is no apparent wearout in the useful lifetime of the products, based on 4 years of data.

Thus the model for field performance shown in Figure 22 does not include a positive slope section corresponding to the wearout region of the bathtub curve.


FIGURE 25 Probability distribution. Net present savings for ESS on a new CPU product (B in Fig. 24).


FIGURE 26 Probability distribution. Net present savings for ESS on a mature CPU product (E in Fig. 24).

Cost Effectiveness of Environmental Stress Screening

Environmental stress screening is effective for both state-of-the-art as well as mature PWAs. The "more bang for the buck" comes from manufacturing yield improvement afforded by applying 100% ESS to state-of-the-art PWAs. How ever, it has value for mature PWAs as well. In both cases this translates to less system test problems and field issues.

The previous section showed the effectiveness of 100% ESS. Pre-ESS manufacturing yields compared with post-ESS yields show improvement of shippable PWA quality achieved by conducting 100% ESS. This is directly translated into lower product cost, positive customer goodwill, and customer product re-buys.

In all of the preceding discussions, it is clear that a key to success in the field of accelerated stress testing is the ability to make decisions in the face of large uncertainties. Recent applications of normative decision analysis in this field show great promise. Figures 25 and 26 show the probability of achieving positive net present savings (NPS) when applying ESS to a mature CPU product and a new CPU product, respectively.

The NPS is net present savings per PWA screened in the ESS operation.

It is the present value of all costs and all benefits of ESS for the useful lifetime of the PWA. A positive NPS is obtained when benefits exceed costs.

This approach to decision making for ESS shows that there is always a possibility that screening will not produce the desired outcome of fielding more reliable products, thus reducing costs and saving money for the company. Many variables must be included in the analysis, both technical and financial. From the two distributions shown in Figures 25 and 26, it is seen that there is a 20% chance we will have a negative NPS for CPU product B and an 80% chance we will have a negative NPS for CPU product E. The decision indicated is to continue ESS for CPU product B, and to stop ESS for CPU product E. The obvious next step is to analyze the decision of whether or not to perform ESS on a sample basis.

Here we see the historical trend repeating itself-as products become more reliable, there is less need to perform ESS. Just as in the case for components, the economics of screening become less favorable as the product becomes more reliable. Emphasis then shifts to sampling or audit ESS.

3.1. Lessons Learned Through Testing

A small percentage (typically about 10%) of PWAs that are electrically tested and exposed to ESS exhibit anomalies or problem conditions, resulting in manufacturing yield loss. These situations, depending on severity and frequency of occurrence, are typically investigated through a troubleshooting process to deter mine the cause of the problem or failure. The troubleshooting process requires that effective and efficient diagnostics are available to isolate the problem, usually to one or more components. These components are then removed from the PWA and tested separately on automated testing equipment (ATE) to verify that the PWA anomaly was due to a specific component (in reality only a small percentage of removed components are tested due to time and cost constraints). The ATE testing of the component(s) removed often results in a finding that there is nothing wrong with them(no trouble found-NTF), they meet all published specifications.

In the electronics equipment/product industry 40-50% of the anomalies discovered are no trouble found. Much time in manufacturing is spent troubleshooting the NTFs to arrive at a root cause for the problem. Often the components are given to sustaining engineering for evaluation in a development system versus the artificial environment of automated test equipment. Investigating the cause of no trouble found/no defect found (NDF)/no problem found (NPF) components is a painstaking, costly, and time-consuming process. How far this investigation is taken depends on the product complexity, the cost of the product, the market served by the product, and the amount of risk and ramifications of that risk the OEM is prepared to accept. Listed in the following sections are lessons learned from this closed-loop investigative anomaly verification-corrective action pro cess.

Lesson 1: PWAs are complex structures.

The "fallout" from electrical tests and ESS is due to the complex interactions between the components themselves (such as parametric distribution variation leading to margin pileup when all components are interconnected on the PWA, for example), between the components and the PWA materials, and between the hardware and software.

Lesson 2: Determine the root cause of the problem and act on the results.

Performing electrical testing or ESS by itself has no value. Its what is done with the outcome or results of the testing that counts. Problems, anomalies and failures discovered during testing must be investigated in terms of both short- and long term risk. For the short-term, a containment and/or screening strategy needs to be developed to ensure that the defective products don't get to the field/customer.

For the long term, a closed-loop corrective action process to preclude recurrence of the failures is critical to achieving lasting results. In either case the true root cause of the problem needs to be determined, usually through an intensive investigative process that includes failure analysis. This is all about risk evaluation, containment, and management. It is imperative that the containment strategies and corrective actions developed from problems or failures found during electrical testing and ESS are fed back to the Engineering and Manufacturing Departments and the component suppliers. To be effective in driving continuous improvement, the results of ESS must be

1. Fed back to Design Engineering to select a different supplier or improve a supplier's process or to make a design/layout change

2. Fed back to Design Engineering to select a different part if the problem was misapplication of a given part type or improper interfacing with other components on the PWA

3. Fed back to Design Engineering to modify the circuit design, i.e., use a mezzanine card, for example

4. Fed back to Manufacturing to make appropriate process changes, typically of a workmanship nature Lesson 3: Troubleshooting takes time and requires a commitment of re sources.

Resources required include skilled professionals along with the proper test and failure analysis tools. There is a great deal of difficulty in doing this because engineers would prefer to spend their time designing the latest and greatest product rather than support an existing production design. But the financial payback to a company can be huge in terms of reduced scrap and rework, increased revenues, and increased goodwill, customer satisfaction, and rebuys.

Lesson 4: Components are not the major cause of problems/anomalies.

Today the causes of product/equipment errors and problems are due to handling problems (mechanical damage and ESD), PCB attachment (solderability and workmanship) issues; misapplication/misues of components (i.e., design application not compatible with component); connectors; power supplies; electrical overstress (EOS); system software "rev" versions; and system software-hard ware interactions.

Lesson 5: The majority of problems are NTF/NDF/NPF.

In analyzing and separately testing individual components that were removed from many PWAs after the troubleshooting process, it was found that the re moved components had no problems. In reality though, there is no such thing as an NDF/NTF/NPF; the problem has just not been found due either to insufficient time or resources being expended or incomplete and inaccurate diagnostics. Sub sequent evaluation in the product or system by sustaining engineering has revealed that the causes for NTF/NDF/NPF result from

1. Shortcuts in the design process resulting in lower operating margins and yields and high NTF. This is due to the pressures of fast time to market and time to revenue.

2. Test and test correlation issues, including low test coverage and not using current (more effective) revision of test program and non-comprehensive PWA functional test software.

3. Incompatible test coverage of PWA to component.

4. Component lot-to-lot variation. Components may be manufactured at various process corners impacting parametric conditions such as bus hold, edge rate, and the like.

5. Design margining/tolerance stacking of components used due to parametric distributions (variation). For example, timing conditions can be violated when ICs with different parametric distributions are connected together; yet when tested individually, they operate properly within their specification limits.

6. Importance and variation of unspecified component parameters on sys tem operation.

System/equipment software doesn't run the same way on each system when booted up. There are differences and variations in timing parameters such as refresh time. The system software causes different combi nations of timing to occur and the conditions are not reproducible. Chip die shrinks, which speed up a component, tend to exacerbate timing issues/incompatibilities.

8. System noise and jitter; noisy power supplies; long cable interconnection lengths; microprocessor cycle slip (40% of microprocessor problems); and pattern/speed sensitivity.

9. Statistical or load-dependent failures-those failures that occur after the system (computer) has been run for a long time, such as cycle lurch and correctable memory errors.

I decided to follow up the NDF/NTF/NPF issue further and surveyed my colleagues and peers from Celestica, Compaq Computer Corp., Hewlett-Packard, IBM, Lucent Technologies, SGI, and Sun Microsystems to find out what their experience had been with NDFs. The results are summarized in Table 7.

There are several proposed approaches in dealing with NDF/NTF/NPF.

These are presented in ascending order of application. I thank my colleagues at Celestica for providing these approaches.

=========

Table (coming soon) 7 Computer Industry Responses Regarding No Defect Found Issues

The single largest defect detractor (_40%) is the no defect/no trouble found issue.

Of 40 problem PWA instances, two were due to manufacturing process issues and the balance to design issues.

Lucent expended resources to eliminate NDFs as a root cause and has found the real causes of so-called NDFs to be lack of training, poor specifications, inadequate diagnostics, different equipment used that didn't meet the interface standards, and the like.

No defect found is often the case of a shared function. Two ICs together constitute an electrical function (such as a receiver chip and a transmitter chip). If a problem occurs, it is difficult to determine which IC is the one with the defect because the inner functions are not readily observable.

Removing one or more ICs from a PWA obliterates the nearest neighbor, shared function, or poor solder joint effects that really caused the problem.

Replace a component on a PWA and the board becomes operational. Replacing the component shifted the PWA's parameters so that it works. (But the PWA was not defective.) The most important and effective place to perform troubleshooting is at the customer site since this is where the problem occurred. However, there is a conflict here because the field service technician's job is to get the customer's system up and running as fast as possible. Thus, the technician is always shot-gunning to resolve a field or system problem and removes two or three boards and/or replaces multiple components resulting in false pulls. The field service technician, however, is the least trained to do any troubleshooting and often ends up with a trunk full of components. Better diagnostics are needed.

Software-hardware interactions may happen once in a blue moon.

In a manufacturing, system, or field environment we are always attempting to isolate a problem or anomaly to a thing: PWA, component, etc. In many instances the real culprit is the design environment and application (how a component is used, wrong pull-up or pull-down resistor values, or a PWA or component doesn't work in a box that is heated up for example). Inadequate design techniques are big issues.

Testing a suspected IC on ATE often finds nothing wrong with the IC because the ATE is an artificial environment that is dictated by the ATE architecture, strobe placement, and timing conditions. Most often the suspected IC then needs to be placed in a development system using comprehensive diagnostics that are run by sustaining development engineers to determine if the IC has a problem or not. This ties in with Lessons 3 and 7.

========

Strategies for No Defect Found Investigation

Test Three Times. In order to verify the accuracy of the test equipment and the interconnection between the test equipment and the unit under test (UUT), it is recommended that the UUT be tested three times. This involves removing the UUT from the test fixture or disconnecting the test equipment from the UUT and reinstalling/reconnecting the UUT to the test equipment three times. If the UUT fails consistently, the accuracy of the test equipment or the interconnection is not likely to be the cause of the UUT failure. If, however, after disconnect/ reconnect the UUT does not fail again, there is a possibility that the test equipment may play a part in the failure of the UUT, requiring further investigation.

This method applies to both PWAs and individual components.

Remove/Replace Three Times. In order to verify the interconnection be tween the suspected failing component and the PWA, the suspected failing component should be disconnected from the PWA and reinstalled in the PWA three times (where it is technically feasible). If the suspected failing component does not fail consistently, this is a strong clue that the interconnection between the PWA and the suspected component contributes to the PWA failure and should be investigated further. If the failure does repeat consistently after removal and replacement three times, the suspected failing component should be used in a component swap, as described next.

Component Swap. Under the component swap technique, a suspected failing component is removed from the failing application and swapped into a similar, but passing application. Concurrently, the same component from the passing application is swapped into the failing application. If after the component swap, the failed PWA now passes and the passing PWA fails, that's a pretty strong clue that the component was the cause of the problem and failure analysis of the component can begin.

If after the swap, the failed PWA still fails and the passing PWA still passes, then the swapped component probably is not the cause of the problem and further failure investigation and fault isolation is required. If after the swap, both PWAs now pass, the cause probably has something to do with the interconnection be tween the component and the PWA (maybe a cold solder joint or fracture of a solder joint) and probably had nothing to do with the component. An NTF has been avoided.

The following examples illustrate the methodologies just discussed.

Example 1: A printed wiring assembly fails at system integration test. This failing PWA should be disconnected from the test hardware, reconnected and tested three times. If the PWA continues to consistently fail, proceed to component swap. In this swap environment, the failed PWA from the fail system is swapped with a passing PWA from a passing system. If the failed system now passes and the passing system now fails, there's a good chance that something is wrong with the PWA. The PWA should be returned to the PWA manufacturer for further analysis. If after the swap, the failed system still fails and the passing system still passes, then the suspect PWA isn't likely to be the root cause of the problem. If after the swap, both systems now pass, there may be some interconnect issue between the PWA and the system that should be assessed.

Example 2: A PWA manufacturer is testing a PWA at functional test and it fails. The PWA should be removed from the test fixture, replaced, and retested three times. If the PWA failure persists it is up to the failure analysis technician to isolate the failure cause to an electronic component. The suspected component should be removed and replaced (if technically feasible) and retested three times.

If the failure still persists, the failed component should be removed from the failing PWA and swapped with a passing component from a passing PWA. Again, if after the swap, the failed PWA now passes and the passing PWA fails, it's reasonable to conclude that the component was the root cause of the problem and should be returned to the component supplier for root cause analysis and corrective action. If, however, after the swap, the failed PWA still fails and the passing PWA still passes, then the swapped component probably is not the cause of the problem. If after the swap, both PWAs now pass, the root cause may have something to do with the interconnection between the component and the PWA (such as a fractured solder joint) and nothing to do with the component.

Lesson 6: Accurate diagnostics and improved process mapping tests are required for problem verification and resolution.

Let's take an example of a production PWA undergoing electrical test to illustrate the point. Diagnostics point to two or three possible components that are causing a PWA operating anomaly. This gives a 50% and 33% chance of finding the problem component, respectively, and that means that the other one or two components are NTF. Or it may not be a component issue at all, in which case the NTF is 100%. Take another example. Five PWAs are removed from a system to find the one problem PWA, leaving four PWAs, or 80%, as NTF in the best case. This ties in with Lesson 5.

The point being made is that diagnostics that isolate a 50% NTF rate are unacceptable. Diagnostics are typically not well defined and need to be improved because PWAs are complex assemblies that are populated with complex components.

Lesson 7: Sustaining Engineering needs to have an active role in problem resolution.

Many of the anomalies encountered are traced to one or more potential problem components. The problem components need to be placed in the product/system to determine which, if any, component is truly bad. As such, sustaining engineering's active participation is required. This ties in with Lesson 3.

NEXT>>

Top of Page PREV.   NEXT Article Index HOME