|Home | Articles | Forum | Glossary | Books|
1. WHAT IS RELIABILITY?
To set the stage, this guide deals with the topic of electronic product hardware reliability. Electronic products consist of individual components (such as integrated circuits, resistors, capacitors, transistors, diodes, crystals, and connectors) assembled on a printed circuit board; third party-provided hardware such as disk drives, power supplies, and various printed circuit card assemblies; and various mechanical fixtures, robotics, shielding, cables, etc., all integrated into an enclosure or case of some sort.
The term reliability is at the same time ambiguous in the general sense but very exacting in the practical and application sense when consideration is given to the techniques and methods used to ensure the production of reliable products.
Reliability differs/varies based on the intended application, the product category, the product price, customer expectations, and the level of discomfort or repercussion caused by product malfunction. For example, products destined for consumer use have different reliability requirements and associated risk levels than do products destined for use in industrial, automotive, telecommunication, medical, military, or space applications.
Customer expectations and threshold of pain are important as well. What do I mean by this? Customers have an expectation and threshold of pain for the product they purchase based on the price paid and type of product. The designed-in reliability level should be just sufficient enough to meet that expectation and threshold of pain. Thus, reliability and customer expectations are closely tied to price. For example if a four- to five-function electronic calculator fails, the customer's level of irritation and dissatisfaction is low. This is so because both the purchase price and the original customer expectation for purchase are both low.
The customer merely disposes of it and gets another one. However, if your Lexus engine ceases to function while you are driving on a busy freeway, your level of anxiety, irritation, frustration, and dissatisfaction are extremely high. This is because both the customer expectation upon purchase and the purchase price are high. A BMW is not a disposable item.
Also, for a given product, reliability is a moving target. It varies with the maturity of the technology and from one product generation to the next. For example, when the electronic calculator and digital watch first appeared in the marketplace, they were state-of-the-art products and were extremely costly as well. The people who bought these products were early adopters of the technology and expected them to work. Each product cost in the neighborhood of several hundred dollars (on the order of $800-$900 for the first electronic calculator and $200-$400 for the first digital watches). As the technology was perfected (going from LED to LCD displays and lower-power CMOS integrated circuits) and matured and competition entered the marketplace, the price fell over the years to such a level that these products have both become disposable commodity items (except for high-end products). When these products were new, unique, and high priced, the customer's reliability expectations were high as well. As the products became mass-produced disposable commodity items, the reliability expectations became less and less important; so that today reliability is almost a "don't care" situation for these two products. The designed-in reliability has likewise de creased in response to market conditions.
Thus companies design in just enough reliability to meet the customer's expectations, i.e., consumer acceptance of the product price and level of discomfort that a malfunction would bring about. You don't want to design in more reliability than the application warrants or that the customer is willing to pay for.
Table 1 lists the variables of price, customer discomfort, designed-in reliability, and customer expectations relative to product/application environment, from the simple to the complex.
TABLE 1 Key Customer Variables Versus Product Categories/Applications Environment
Then, too, a particular product category may have a variety of reliability requirements. Take computers as an example. Personal computers for consumer and general business office use have one set of reliability requirements; computers destined for use in high-end server applications (CAD tool sets and the like) have another set of requirements. Computers serving the telecommunication industry must operate for 20-plus years; applications that require nonstop availability and 100% data integrity (for stock markets and other financial transaction applications, for example) have an even higher set of requirements. Each of these markets has different reliability requirements that must be addressed individually during the product concept and design phase and during the manufacturing and production phase.
Reliability cannot be an afterthought apart from the design phase, i.e., something that is considered only when manufacturing yield is low or when field failure rate and customer returns are experienced. Reliability must be designed and built (manufactured) in from the start, commensurate with market and customer needs. It requires a complete understanding of the customer requirements and an accurate translation of those requirements to the language of the system designer. This results in a design/manufacturing methodology that produces a reliable delivered product that meets customer needs. Electronic hardware reliability includes both circuit and system design reliability, manufacturing process reliability, and product reliability. It is strongly dependent on the reliability of the individual components that comprise the product design. Thus, reliability begins and ends with the customer. Figure 1 shows this end-to-end product reliability methodology diagrammatically.
Stated very simply, reliability is not about technology. It's about customer service and satisfaction and financial return. If a consumer product is reliable, customers will buy it and tell their friends about it, and repeat business will ensue.
The same holds true for industrial products. The net result is less rework and low field return rate and thus increased revenue and gross margin. Everything done to improve a product's reliability is done with these thoughts in mind.
Now that I've danced around it, just what is this nebulous concept we are talking about? Quality and reliability are very similar terms, but they are not interchangeable. Both quality and reliability are related to variability in the electronic product manufacturing process and are interrelated, as will be shown by the bathtub failure rate curve that will be discussed in the next section.
Quality is defined as product performance against requirements at an instant in time. The metrics used to measure quality include:
PPM: parts per million defective
AQL: acceptable quality level
LTPD: lot tolerance percent defective
Reliability is the performance against requirements over a period of time.
Reliability measurements always have a time factor. IPC-SM-785 defines reliability as the ability of a product to function under given conditions and for a specified period of time without exceeding acceptable failure levels.
According to IPC standard J-STD-001B, which deals with solder joint reliability, electronic assemblies are categorized in three classes of products, with increasing reliability requirements.
Class 1, or general, electronic products, including consumer products. Reliability is desirable, but there is little physical threat if solder joints fail.
Class 2, or dedicated service, electronics products, including industrial and commercial products (computers, telecommunications, etc.). Reliability is important, and solder joint failures may impede operations and in crease service costs.
Class 3, or high-performance, electronics products, including automotive, avionics, space, medical, military, or any other applications where reliability is critical and solder joint failures can be life/mission threatening.
Class 1 products typically have a short design life, e.g., 3 to 5 years, and may not experience a large number of stress cycles. Class 2 and 3 products have longer design lives and may experience larger temperature swings. For example, commercial aircraft may have to sustain over 20,000 takeoffs and landings over a 20-year life, with cargo bay electronics undergoing thermal cycles from ground level temperatures (perhaps as high as 50°C under desert conditions) to very low temperatures at high altitude (about -55°C at 35,000 feet). The metrics used to measure reliability include:
Percent failure per thousand hours
MTBF: mean time between failure
MTTF: mean time to failure
FIT: failures in time, typically failures per billion hours of operation
Reliability is a hierarchical consideration at all levels of electronics, from materials to operating systems because:
Materials are used to make components.
Components compose subassemblies.
Subassemblies compose assemblies.
Assemblies are combined into systems of ever-increasing complexity and sophistication.
2. DISCIPLINE AND TASKS INVOLVED WITH PRODUCT RELIABILITY
Electronic product reliability encompasses many disciplines, including component engineering, electrical engineering, mechanical engineering, materials science, manufacturing and process engineering, test engineering, reliability engineering, and failure analysis. Each of these brings a unique perspective and skill set to the task. All of these need to work together as a single unit (a team) to accomplish the desired product objectives based on customer requirements.
These disciplines are used to accomplish the myriad tasks required to develop a reliable product. A study of 72 nondefense corporations revealed that the product reliability techniques they preferred and felt to be important were the following (listed in ranked order) (1):
Supplier control 76%
Parts control 72% Failure analysis and corrective action 65%
Environmental stress screening 55%
Test, analyze, fix 50%
Reliability qualification test 32%
Design reviews 24%
Failure modes, effects, and criticality analysis 20%
Each of these companies used several techniques to improve reliability. Most will be discussed in this guide.
3. THE BATHTUB FAILURE RATE CURVE
Historically, the bathtub failure rate curve has been used to discuss electronic equipment (product) reliability. Some practitioners have questioned its accuracy and applicability as a model for reliability. Nonetheless, I use it for "talking purposes" to present and clarify various concepts. The bathtub curve, as shown in Figure 2, represents the instantaneous failure rate of a population of identical items at identical constant stress. The bathtub curve is a composite diagram that provides a framework for identifying and dealing with all phases of the lives of parts and equipment.
Observations and studies have shown that failures for a given part or piece of equipment consist of a composite of the following:
Quality Unrelated to stress Eliminated by inspection process and Not time-dependent process improvements
Reliability Stress-dependent Eliminated by screening Wearout Time-dependent Eliminated by replacement, part design, or new source
Design May be stress- and/or time- Eliminated by proper application and dependent derating
The bathtub curve is the sum of infant mortality, random failure, and wear out curves, as shown in Figure 3. Each of the regions is now discussed.
3.1 Region I-Infant Mortality/Early Life Failures
This region of the curve is depicted by a high failure rate and subsequent flattening (for some product types). Failures in this region are due to quality problems and are typically related to gross variations in processing and assembly. Stress screening has been shown to be very effective in reducing the failure (hazard) rate in this region.
3.2 Region II-Useful Life or Random Failures
Useful life failures are those that occur during the prolonged operating period of the product (equipment). For electronic products it can be much greater than 10 years but depends on the product and the stress level. Failures in this region are related to minor processing or assembly variations. The defects track with the defects found in Region I, but with less severity. Most products have acceptable failure rates in this region. Field problems are due to "freak" or maverick lots.
Stress screening cannot reduce this inherent failure rate, but a reduction in operating stresses and/or increase in design robustness (design margins) can reduce the inherent failure rate.
3.3 Region III-Aging and Wearout Failures
Failures in this region are due to aging (longevity exhausted) or wearout. All products will eventually fail. The failure mechanisms are different than those in regions I and II. It has been stated that electronic components typically wear out after 40 years. With the move to deep submicron ICs, this is dramatically reduced.
Electronic equipment/products enter wearout in 20 years or so, and mechanical parts reach wearout during their operating life. Screening cannot improve reliability in this region, but may cause wearout to occur during the expected operating life. Wearout can perhaps be delayed through the implementation of stress reducing designs.
Figures 4-8 depict the bathtub failure rate curves for human aging, a mechanical component, computers, transistors, and spacecraft, respectively. Note that since mechanical products physically wear out, their life cycle failure rate is very different from the electronic product life curve in the following ways:
significantly shorter total life; steeper infant mortality; very small useful operating life; fast wearout.
Figure 9 shows that the life curve for software is essentially a flat straight line with no early life or wearout regions because all copies of a software program are identical and software reliability is time-independent. Software has errors or defects just like hardware. Major errors show up quickly and frequently, while minor errors occur less frequently and take longer to occur and detect. There is no such thing as stress screening of software.
The goal is to identify and remove failures (infant mortalities, latent defects) at the earliest possible place (lowest cost point) before the product gets in the customer's hands. Historically, this has been at the individual component level but is moving to the printed wiring assembly (PWA) level. These points are covered in greater detail in Sections 4 and 7.
Let me express a note of caution. The bathtub failure rate curve is useful to explain the basic concepts, but for complete electronic products (equipment), the time-to-failure patterns are much more complex than the single graphical representation shown by this curve.
4. RELIABILITY GOALS AND METRICS
Most hardware manufacturers establish reliability goals for their products. Reliability goals constrain the design and prevent the fielding of products that cannot compete on a reliability basis. Reliability goals are based on customer expectations and demand, competitive analysis, comparisons with previous products, and an analysis of the technology capability. A combined top-down and bottom-up approach is used for goal setting and allocation. The top-down approach is based on market demand and competitive analysis. Market demand is measured by customer satisfaction surveys, feedback from specific customers, and the business impact of lost or gained sales in which hardware reliability was a factor. The top down analysis provides reliability goals at a system level, which is the customer's perspective.
The bottom-up approach is based on comparing the current product to previous products in terms of complexity, technology capability, and design/manufacturing processes. Reliability predictions are created using those factors and discussions with component suppliers. These predictions are performed at the unit or board level, then rolled up to the system level to be compared with the top down goals. If they do not meet the top-down goals, an improvement allocation is made to each of the bottom-up goals, and the process is iterated.
However, there is a wide gap between what is considered a failure by a customer and what is considered a failure by hardware engineering. Again, using computers as an example, the customer perceives any unscheduled corrective maintenance (CM) activity on a system, including component replacement, adjustment, alignment, and reboot as a failure. Hardware engineering, however, considers only returned components for which the failure can be replicated as a failure. The customer-perceived failure rate is significantly higher than engineering-perceived failure rate because customers consider no-trouble-found (NTF) component replacements and maintenance activity without component re placement as failures. This dichotomy makes it possible to have low customer satisfaction with regard to product reliability even though the design has met its failure rate goals. To accommodate these different viewpoints, multiple reliability metrics are specified and measured. The reliability goals are also translated based on customer expectations into hardware engineering goals such that meeting the hardware engineering goals allows the customer expectations to be met.
Typical reliability metrics for a high-reliability, high-availability, fault-tolerant computer are shown in Table 2. The CM rate is what customers see.
The part (component) replacement (PR) rate is observed by the factory and logistics organization. The failure rate is the engineers' design objective. The difference between the failure rate and the PR rate is the NTF rate, based on returned components that pass all the manufacturing tests. The difference between the CM rate and PR rate is more complex.
If no components are replaced on a service call, the CM rate will be higher than the PR rate. However, if multiple components are replaced on a single service call, the CM rate will be lower than the PR rate. From the author's experience, the CM rate is higher than the PR rate early in the life of a product when inadequate diagnostics or training may lead to service calls for which no problem can be diagnosed. For mature products these problems have been solved, and the CM and PR rates are very similar.
Each of the stated reliability metrics takes one of three forms:
The relationships among the various forms of the metrics are shown in Figure 10.
TABLE 2 Metric Definitions for a High-Reliability, High-Availability, Fault-Tolerant Computer
A corrective maintenance activity such as a part replacement, adjustment, or reboot. CMs are maintenance activities done in a reactive mode and exclude proactive activity such as preventive maintenance.
A part replacement is any (possibly multiple) part replaced during a corrective maintenance activity. For almost all the parts we track, the parts are returned to the factory, so part replacement rate is equivalent to part return rate.
A returned part that fails a manufacturing or engineering test. Any parts that pass all tests are called no trouble found (NTF). NTFs are important because they indicate a problem with our test capabilities, diagnostics, or support process/training.