Guide to Reliability of Electrical/Electronic Equipment and Products--Robust Design Practices (part 4)

Home | Articles | Forum | Glossary | Books


<<PREV.

14. THERMAL MANAGEMENT

The most reliable and well-designed electronic equipment will malfunction or fail if it overheats. Considering thermal issues early in the design process results in a thermally conscious system layout that minimizes costs through the use of passive cooling and off-the-shelf components. When thermal issues are left until completion of the design, the only remaining solution may be costly and drastic measures, such as the design of a custom heat sink that requires all the space available. Incorporating a heat sink or fan into a product after it has been developed can be expensive, and still may not provide sufficient cooling of the product.

I address thermal issues from two perspectives: from that of the individual ICs and other heat-generating components placed on the PWA and from that of a system or complete equipment/enclosure.

Today's high-speed CMOS integrated circuits operate at or above 1 GHz clock speeds and generate between 60 and 100 W! There is nothing low power about these circuits. Also, the junction temperatures have been steadily declining from 150°C to about 85-90°C for leading edge ICs. What all this means is that these ICs are operating in an accelerated manner (similar to that previously en countered during IC burn-in) in their intended ambient application. Integrated circuit suppliers estimate that for every 10°C rise of the junction temperature, the device failure rate doubles. If the heat generated inside the IC is not removed, its reliability is compromised. So there is a real challenge here in using leading edge ICs.

According to Moore's law, the amount of information stored in an IC (ex pressed as density in terms of the number of on-chip transistors) doubles every 18 months. This has been a valid measure of IC improvement since the 1970s and continues today. Moore's law also applies to thermal management. As chip technology becomes increasingly smaller and more powerful, the amount of heat generated per square inch increases accordingly. Various system level power management techniques like low-power quiescent modes, clock gating techniques, use of low-power circuits and low-power supply voltage, and power versus-performance tradeoffs are widely used to reduce the generated heat. How ever, it is not all good news. Activation of an IC that is in its quiescent or quiet mode to its normal operating mode causes a large current spike, resulting in rapid heating. This produces a large thermal gradient across the surface of the IC die (or across several areas of the die), potentially cracking the die or delaminating some of the material layers. New assembly and packaging technology developments make the situation even more complex, requiring new approaches to cooling.

The ability of an electronic system to dissipate heat efficiently depends on the effectiveness of the IC package in conducting heat away from the chip (IC) and other on-board heat-generating components (such as DC/DC converters) to their external surfaces, and the effectiveness of the surrounding system to dissipate this heat to the environment.

The thermal solution consists of two parts. The first part of the solution is accomplished by the IC and other on-board component suppliers constructing their packages with high thermal conductivity materials. Many innovative and cost effective solutions exist, from the tiny small outline integrated circuit and chip scale packages to the complex pin grid array and ball grid array packages housing high-performance microprocessors, FPGAs, and ASICs.

Surface mount technology, CSP and BGA packages and the tight enclosures demanded by shrinking notebook computers, cell phones, and personal digital assistant applications require creative approaches to thermal management. In creased surface mount densities and complexities can create assemblies that are damaged by heat in manufacturing. Broken components, melted components, warped PWAs, or even PWAs catching on fire may result if designers fail to provide for heat buildup and create paths for heat flow and removal. Stress buildup caused by different coefficients of thermal expansion (CTE) between the PWA and components in close contact is another factor affecting equipment/ system assembly reliability. Not only can excessive heat affect the reliability of surface mount devices, both active and passive, but it can also affect the operating performance of sensitive components, such as clock oscillators and mechanical components such as disk drives. The amount of heat generated by the IC, the package type used, and the expected lifetime in the product combine with many other factors to determine the optimal heat removal scheme.

In many semiconductor package styles, the only thing between the silicon chip and the outside world is high thermal conductivity copper (heat slug or spreader) or a thermally equivalent ceramic or metal. Having reached this point the package is about as good as it can get without resorting to the use of exotic materials or constructions and their associated higher costs. Further refinements will happen, but with diminishing returns. In many applications today, the pack age resistance is a small part (less than 10%) of the total thermal resistance.

The second part of the solution is the responsibility of the system designer.

High-conductivity features of an IC package (i.e., low thermal resistance) are wasted unless heat can be effectively removed from the package surfaces to the external environment. The system thermal resistance issue can be dealt with by breaking it down into several parts: the conduction resistance between the IC package and the PWA; the conduction resistance between the PWAs and the external surface of the product/equipment; the convection resistance between the PWA, other PWAs, and the equipment enclosure; and the convection resistance between these surfaces and the ambient. The total system thermal resistance is the sum of each of these components. There are many ways to remove the heat from an IC: placing the device in a cool spot on the PWA and in the enclosure; distributing power-generating components across the PWA; and using a liquid cooled plate connected to a refrigerated water chiller are among them.

Since convection is largely a function of surface area (larger means cooler), the opportunities for improvement are somewhat limited. Oftentimes it is not practical to increase the size of an electronic product, such as a notebook computer, to make the ICs run cooler. So various means of conduction (using external means of cooling such as heat sinks, fans, or heat pipes) must be used.

The trend toward distributed power (DC/DC converters or power regulators on each PWA) is presenting new challenges to the design team in terms of power distribution, thermal management, PWA mechanical stress (due to weight of heat sinks), and electromagnetic compatibility. Exacerbating these issues still further is the trend toward placing the power regulator as close as possible to the micro processor (for functionality and performance reasons), even to the point of putting them together in the same package. This extreme case causes severe conflicts in managing all issues. From a thermal perspective, the voltage regulator module and the microprocessor should be separated from each other as far as possible.

Conversely, to maximize electrical performance requires that they be placed as close together as possible. The microprocessor is the largest source of electromagnetic interference, and the voltage regulator module adds significant levels of both conducted and radiated interference. Thus, from an EMI perspective the voltage regulator and microprocessor should be integrated and encapsulated in a Faraday cage. However, this causes some serious thermal management issues relating to the methods of providing efficient heat removal and heat sinking. The high clock frequencies of microprocessors requires the use of small apertures to meet EMI standards which conflict with the thermal requirement of large openings in the chassis to create air flow and cool devices within, challenging the design team and requiring that system design tradeoffs and compromises be made.

A detailed discussion of thermal management issues is presented in section 5.


FIGURE 15 The signal integrity issue as displayed on an oscilloscope. (From Ref. 2, used with permission from Evaluation Engineering, November 1999.)


FIGURE 16 Impact of faster ICs on timing margins. (From Ref. 2, used with permission from Evaluation Engineering, November 1999.)

15. SIGNAL INTEGRITY AND DESIGN FOR ELECTROMAGNETIC COMPATIBILITY

Intended signals need to reach their destination at the same time all the time.

This becomes difficult as microprocessor and clock speeds continue to increase, creating a serious signal integrity issue. Signal integrity addresses the impact of ringing, overshoot, undershoot, settling time, ground bounce, crosstalk, and power supply noise on high-speed digital signals during the design of these systems. Some symptoms that indicate that signal integrity (SI) is an issue include skew between clock and data, skew between receivers, fast clocks (less setup time, more hold time) and fast data (more setup time, less hold time), signal delay, and temperature sensitivities, Figure 15 shows a signal integrity example as it might appear on a high-bandwidth oscilloscope. The clock driver has a nice square wave output wave form, but the load IC sees a wave form that is distorted by both overshoot and ringing. Some possible reasons for this condition include the PCB trace may not have been designed as a transmission line; the PCB trace transmission line design may be correct, but the termination may be incorrect; or a gap in either ground or power plane may be disturbing the return current path of the trace.

As stated previously, signal integrity is critical in fast bus interfaces, fast microprocessors, and high throughput applications (computers, networks, telecommunications, etc.). Figure 16 shows that as circuits get faster, timing margins decrease, leading to signal integrity issues. In a given design some of the ways in which faster parts are used, thus causing SI problems, include

A faster driver is chosen for a faster circuit.

A slow part is discontinued, being replaced by a new and faster version.

An original part is replaced by a faster "die-shrunk" part to reduce manufacturing costs.

An IC manufacturer develops one part for many applications or has faster parts with the same part number as the older (previous) generation parts.

Without due consideration of the basic signal integrity issues, high-speed products will fail to operate as intended.

Signal integrity wasn't always important. In the 1970-1990 time frame, digital logic circuitry (gates) switched so slowly that digital signals actually looked like ones and zeroes. Analog modeling of signal propagation was not necessary. Those days are long gone. At today's circuit speeds even the simple passive elements of high-speed design-the wires, PC boards, connectors, and chip packages-can make up a significant part of the overall signal delay. Even worse, these elements can cause glitches, resets, logic errors, and other problems.

Today's PC board traces are transmission lines and need to be properly managed. Signals traveling on a PCB trace experience delay. This delay can be much longer than edge time, is significant in high-speed systems, and is in addition to logic delays. Signal delay is affected by the length of the PCB trace and any physical factors that affect either the inductance (L) or capacitance (C), such as the width, thickness, or spacing of the trace; the layer in the PCB stack-up; material used in the PCB stack-up; and the distance to ground and VCC planes.

Reflections occur at the ends of a transmission line unless the end is terminated in Zo (its characteristic impedance) by a resistor or another line. Zo _ vL/C and determines the ratio of current and voltage in a PCB trace. Increasing PCB trace capacitance by moving the traces closer to the power plane, making the traces wider, or increasing the dielectric constant decreases the trace impedance.

Capacitance is more effective in influencing Zo because it changes faster than inductance with cross-sectional changes.

Increasing PCB trace inductance increases trace impedance; this happens if the trace is narrow. Trace inductance doesn't change as quickly as capacitance does when changing the cross-sectional area and is thus less effective for influencing Zo. On a practical level, both lower trace impedances and strip lines (having high C and low Zo) are harder to drive; they require more current to achieve a given voltage.

How can reflection be eliminated?

Slow down the switching speed of driver ICs. This may be difficult since this could upset overall timing.

Shorten traces to their critical length or shorter.

Match the end of the line to Zo using passive components.

Signal integrity and electromagnetic compatibility (EMC) are related and have an impact on each other. If an unintended signal, such as internally or externally coupled noise, reaches the destination first, changes the signal rise time, or causes it to become nonmonotonic, it's a timing problem. If added EMC suppression components distort the waveform, change the signal rise time, or increase delay, it's still a timing problem. Some of the very techniques that are most effective at promoting EMC at the PWA level are also good means of improving SI. When implemented early in a project, this can produce more robust designs, often eliminating one prototype iteration. At other times techniques to improve EMC are in direct conflict with techniques for improving SI.

How a line is terminated determines circuit performance, SI, and EMC.

Matched impedance reduces SI problems and sometimes helps reduce EMC is sues. But some SI and EMC effects conflict with each other. Tables 13 and 14 compare termination methods for their impact from signal integrity and EMC perspectives, respectively. Notice the conflicting points between the various termination methods as applicable to SI and EMC.

If fast-switching ICs were not used in electronic designs and we didn't have signal transitions, then there would be no SI problems, or products manufactured for that matter. The faster the transitions, the bigger the problem. Thus, it is important to obtain accurate models of each IC to perform proper signal integrity and EMC analysis. The models of importance are buffer rather than logic models because fast buffer slew times relative to the line lengths cause most of the trouble.

=======

TABLE 13 Comparison of Line Termination Methods from a Signal Integrity Perspective

Type

Series Pull-up/down AC parallel Diode

----

Advantage

Single component Low power Damps entire circuit

Single component Value choice easy Okay for multiple receivers

Low power Easy resistor choice Okay for multiple receivers

Works for a variety of impedances

-----

Disadvantage

Value selection difficult

Best for concentrated receiver loads

Large DC loading

Increased power

Two components

Difficult to choose capacitor

Two components

Diode choice difficult

Some over/undershoot

=========

TABLE 14 Comparison of Line Termination Methods from an EMC Perspective

Type:

Series DC pull-up/down AC parallel Diode

Summary:

Best So-so

So-so

Worst

EMC effect:

Reduced driver currents give good performance. Works best when resistor is very close to driver.

Less ringing generally reduces EMI. Some frequencies may increase.

Similar to DC parallel, but better if capacitor is small.

Can generate additional high-frequency emissions.

========= There are two widely used industry models available, SPICE and IBIS.

SPICE is a de facto model used for modeling both digital and mixed-signal (ICs with both digital and analog content) ICs. IBIS is used for modeling digital systems under the auspices of EIA 656. It is the responsibility of the IC suppliers (manufacturers) to provide these models to original equipment manufacturers (OEMs) for use in their system SI analysis.

In summary, as operating speeds increase the primary issues that need to be addressed to ensure signal integrity include:

1. A greater percentage of PCB traces in new designs will likely require terminators. Terminators help control ringing and overshoot in transmission lines. As speeds increase, more and more PCB traces will be gin to take on aspects of transmission line behavior and thus will re-quire terminators. As with everything there is a tradeoff that needs to be made. Since terminators occupy precious space on every PC board and dissipate quite a bit of power, the use of terminators will need to be optimized, placing them precisely where needed and only where needed.

2. The exact delay of individual PCB traces will become more and more important. Computer-aided tool design manufacturers are beginning to incorporate features useful for matching trace lengths and guaranteeing low clock slew. At very high speeds these features are crucial to system operation.

3. Crosstalk issues will begin to overwhelm many systems. Every time the system clock rate is doubled, crosstalk intensifies by a factor of two, potentially bringing some systems to their knees. Some of the symptoms include data-dependent logic errors, sudden system crashes, software branches to nowhere, impossible state transitions, and unexplained interrupts. The dual manufacturing/engineering goal is to com press PCB layout to the maximum extent possible (for cost reasons), but without compromising crosstalk on critical signals.

4. Ground bounce and power supply noise are big issues. High powered drivers, switching at very fast rates, in massive parallel bus structures are a sure formula for power system disaster. Using more power and ground pins and bypass capacitors helps, but this takes up valuable space and adds cost.

Section 6 presents an in-depth discussion of EMC and design for EMC.

Top of Page PREV.   NEXT Article Index HOME