Imagine the fantasy of perfect reliability. Applied to a single instance of a product, it would mean that the product always worked exactly as intended—however it was used, whatever conditions it was exposed to. Perfect reliability is hard enough to achieve in a single product, but very few products are built as single instances. Even without considering normal manufacturing variance, inherent randomness of the grain structure within a product's materials dictates that no two copies of a product could ever be built exactly alike. Just that level of variation at the grain structure level can give rise to differences in failure outcomes. Real world manufacturing processes then add further variance to different instances of a given product. Once the products are put in service, the particular life-time use of any copy will be unique (some copies may be overly stressed, others treated gently), and will take place under differing external conditions (heat, cold, shock loads, dust, etc.), all of which add further variation. Perfect reliability would mean that no copies ever failed—that in spite of manufacturing variances, differing uses, and exposure to differing conditions, nothing ever went wrong.
In normal practice, then, across all copies, all uses, and all exposures, some number of product failures will occur. Coming to a quantitative understanding of these failures—how many are likely to occur, when they are likely to occur, under what conditions they will occur, why they will occur, how to reduce their occurrence, therefore has significant benefits.
Most products are composites of many components. The reliability of the product is a function of the reliability of all of the parts, including connections between the parts. Disregarding operator error—e.g., settings on the production line were grossly mis-set—products fail because of a material failure in a component. For example, electronic devices can fail when an interconnect fails. Materials fail because repetitive stress applied over time causes internal micro-structures (e.g., grain domains) to move or distort sufficiently to nucleate a discontinuity which leads to the propagation of a small crack, leading to a larger one, and finally to outright material failure.
Electronic devices, such as power supplies, are particularly reliant on the integrity of interconnects or solder bonds. The reliability of interconnects is a concern because it is widely expressed in the open literature that fracture failures in solder joints account for up to 70% of failures in electronic components. Interconnect or solder degradation and failure is principally due to thermomechanical fatigue mechanisms.
Electronic devices such as power supplies are complex multilayered devices consisting of different materials with inherent variability. Power supply systems are—compared to other electronic systems—highly susceptible to failure due to the high voltage and current conditions in which they routinely operate. Competitive pressures are demanding that electronics be operated under increasingly harsh environments and operating conditions. Also, the trend to provide more processing power from smaller and smaller integrated circuits is accelerating. However, even electronic devices fail eventually, regardless of how well they are engineered. Unlike mechanical systems, these electronic systems do not actively display conventional fault signals prior to failure. As a device is operated, thermal and/or mechanical loads are induced in it. These loads are translated from the device level to the localized interconnect level. Thermal gradient cycling occurs during system operation and eventually results in thermo-mechanical fatigue induced failure. Failure can frequently be attributed to structural, material, or manufacturing defects. For example, an electronic circuit can fail from the loss of a solder joint. A failure at the module or component level often leads to failure of the whole system. Such failures can result in immediate electronic system shutdown with no advanced fault or warning signals, thus preventing the use of conventional fault-to-failure detection approaches as a means of predicting maintenance need. Such failures also present safety or maintenance concerns and often result in economic setbacks such as loss of market share when the product's failure rate becomes sufficiently notorious.
The consequences of failure of a product to the immediate user range from minor inconvenience, to major annoyance, or to catastrophe. Repercussions from such failures ultimately transform into consequences for the manufacturers. It is such consequences that motivate product manufacturers to develop rational strategies to minimize occurrence of failure. The strategies vary depending on specific motivating circumstances, but all involve economic considerations and trade-offs. Even if a product has a significant potential to produce catastrophic results, economic trade-offs cannot be ignored (for one can always spend more and take more time testing, to achieve still higher levels of safety). Less dramatically, when building reliable products is motivated merely by achieving market success, economics is an inherent and more natural part of the calculation.
To approach reliability at a strategic level, an organization must properly integrate reliability factors into the details of its product design processes, deciding throughout the process how much reliability to purchase—that is, how to make rational decisions at all steps along the way about the economic trade-offs associated with the benefits versus the costs of achieving ever greater reliability. Manufacturers that understand reliability properly, and are able to execute according to that understanding, will in the long run significantly out-perform manufacturers that do not. This represents a paradigm shift from old methods in which a reliability specialist designed an analysis framework, tested a product or component in that framework, and repaired or adjusted the product or component accordingly. In approaches advocated herein, so-called reliability-based design, a designer uses a knowledge of failure to develop an understanding of component life, thereby permitting control of various factors.
However, it is simply not practical to directly sense the degradation of electronic components. Their damage states are usually structural and, due to their size, structural response signatures are not monitored on electronic components. None of the electronics industries traditionally used fatigue models to account for the large scatter in the solder weld properties. For example, it would be both difficult and expensive to directly sense the cracking of a single emitter wire bond on a circuit board comprised of thousands of emitter wires. Yet, the failure of a single emitter wire can cause the failure of the entire device.
If there were an effective way to predict the impending failure of an electronic system, module, or component, operators could repair or retire a system before an actual failure, thus avoiding the negative consequences of failure. Thus, accurate prediction of impending failure could have great economic impact on industries whose products rely on electronics such as aerospace, automotive, communications, medical device, domestic appliance, and related sectors.
Engineers have tried to design electronics for high reliability, but most often the reliability information comes very late in the design process. Normally, a statistically significant quantity of reliability data is not obtained until after product launch, and warranty claims from use by consumers have been fielded. This lack of data inspired engineers in the past to make their designs more robust by using safety factors that ensured the designs meet reliability goals.
Similar components frequently present great lifespan variations, however. One electronic element might last many years, but another element produced by the same manufacturer could fail in a few months. Traditional methods of component design attempt to moot the effects of great uncertainty or scatter in lifespan by applying large safety factors to ensure low probabilities of failure. Safety factors, however, are subjective in nature and are usually based on historical use. Since modern manufacturers are incorporating new technology and manufacturing methods faster than ever before, exactly what safety factor is appropriate in today's complex, state-of-the-art, electronics is seldom, if ever, known with certainty. This complicates the engineering design process. Designed-in safety factors tend to add material or structural components, or add complexity to the manufacturing process. Safety factors are counterproductive where industry is attempting to cut costs or reduce weight. To ensure that no component fails during operation (e.g., aircraft parts), components are often retired well before exhausting their useful lifetime. In addition, the true operating life of the component could be much greater than its predicted life. Therefore, given that true operational performance is so difficult to predict accurately, it is common practice within the electronics industry to achieve high reliability through redundancy. Although redundancy allows an electronic system to continue to operate after one or more components have failed, this practice is costly and presents a barrier to electronics miniaturization. Designing cost effective and highly reliable electronics through maximizing component life therefore requires the ability to reduce the safety factors as much as possible for a given design.
Previously the reliability of electronic devices has also been assessed using empirically-based models. Design of experiments is a commonly used tool in which the experimental conditions are systematically varied and a mathematical relationship is “fit” to the data that represents the influence of the conditions to the time or cycles to failure. However, one problem is the fact that there is so much variation in the time or cycles to failure that device life can only be conveyed in the form of a statistical average, such as mean time to failure (MTTF) or the mean time between failure (MTBF). Although these statistical averages provide a general sense about average overall failure, they are a hold over from a time when computer processing power was expensive. They only provide information on a single point number and offer no insight about real world probabilistic variation, true failure mechanisms or the impact those mechanisms have on how a specific design will react to actual field conditions. Accordingly, although such metrics are appropriate in the context of manufactured fleet lot reliability they lack the fidelity for accurate representation of individual device reliability in the field.
The mathematics behind simulation processes, such as a Monte Carlo method, have been widely used within reliability analysis circles. Previous barriers to wide-spread use of such simulations include the fact that a typical designer doesn't have access to reliability data needed to accomplish a system roll-up process. For example, warranty data with relatively good accuracy is readily available to corporate reliability groups, and these relatively few number of engineers are the ones who have been able to perform high quality “advanced-look” reliability assessments for concept designs.
In attempting to reduce reliance on safety factors, designers have developed models for the more prevalent damage mechanisms that lead to failures. Failures can be attributed to many different kinds of damage mechanisms, including electro-migration, buckling, and corrosion. Models for these mechanisms can be used during the design process, usually through deterministic analysis, to identify feasible design concept alternatives. Nevertheless, poor, or less than desired, reliability is often attributed to variability, and deterministic analysis cannot account for variability.
Variability affects electronic reliability through any number of factors including loading scenarios, environmental condition thanges, usage patterns, and maintenance habits. Even the response of a system to a steady input, such as a constant voltage supply, can exhibit variability due to parameters such as a varying ambient temperature.
Over the years, probabilistic techniques have been developed for predicting variability and have been coupled with damage models of failure mechanisms to provide probabilistic damage models that predict the reliability of a population. But, given variability, a prediction of the reliability of a population says little about the future life of an individual member of the population. Safety factors are likewise unsatisfactory methods for predicting the life of an individual since they are based on historical information obtained from a population. Safety factors are also an unsatisfactory method for quickly and efficiently designing against failure since they rely on historical information obtained from test and component data which may not be available in the design phase.
Historically, testing has been the primary means for evaluating the effects of variability. Unfortunately, testing is slow, expensive and burdensome, and evaluation of every possible source of variability is impractical.
The cost of physical tests are rising whereas the cost of computer cycles are plummeting, thereby increasing the practicality of replacing the old “test it” paradigm with a “compute it” paradigm. Testing cannot be completely eliminated. Physical modeling paradigms are not yet sufficiently robust to allow that. However, as part of a new approach to reliability, testing can be focused on providing the critical inputs to the modeling process, allowing computational techniques to then take over and provide a vivid and detailed picture of failure mechanisms—far beyond what testing alone could ever provide. Computational reliability modeling will significantly reduce engineering costs while simultaneously providing a more detailed insight into the reliability issues facing a given product design. The goal of CRM is to allow the design engineer to achieve desired levels of product reliability assurance across the widest possible range of operating conditions, including edge states that bedevil the most robust testing programs.
Failure analysis has revealed that actual component loadings are often well below the steady loads that can cause failure. What distinguishes these failures is the fact that the loads have been applied repeatedly. This is classic fatigue. It is estimated that perhaps 90% of all machine failures are caused by fatigue. Fatigue, or more specifically fatigue crack initiation and growth, is therefore a damage mechanism that degrades the reliability and safe life of components subjected to repeated loads. Such loads could be from thermal, vibratory, shock, and electromagnetic loadings. Although less obvious, this same mode of failure applies to static structures as well. Static structural components are subject to vibrations and movements created from thermal expansion and contraction. Though the movements may be slight, large cyclic forces can result. Designing for fatigue has been difficult hitherto because fatigue typically manifests itself with greatly varying effects on similar components.
Fatigue can occur in any device with either static or moving components, even where the movement is imperceptible, such as is the case with interconnects or solder joints, where there can be very small displacements but very large strains (displacements per unit length). Component failure is frequently insidious, with no prior indication that damage had occurred. Sometimes fatigue can cause intermittent failure. For example, an initiated fatigue crack in solder can cause the device in which the solder is found to operate sporadically due to metallic contact bridging.
Electronic systems are static structures that are subject to these same types of phenomena. Solder joints are particularly vulnerable to fatigue failure. As systems are powered up and down, these interconnect elements are subject to thermal gradient cycling, which, working in combination with vibration, impact, and shock loadings, creates dynamic conditions conducive to fatigue. The typical electronics printed circuit board (PCB) manufacturing processes, in which solder is melted and then cooled, creates joints with complex internal grain structures. These grain structures are under stress from the cooling process, and undergo continuous movement in response to these stresses. This movement, which is on-going even as the system is sitting under non-working conditions in a warehouse, is in itself sufficient to contribute to fatigue vulnerability.
In the case of fatigue failure, scatter in component life is quantified by a coefficient of variation (COV) which is usually determined based on a large number of fatigue life tests on many material specimens, or by full-scale testing of prototype electronic systems. Even under well-controlled laboratory tests of annealed smooth specimens at room temperature, the COV varies from less than 10% to over 500% for different interconnect alloys. Thus, the considerable scatter in the fatigue reliability of components in operation may be substantially attributed to considerable scatter of component material fatigue behavior.
Life scatter of components made from a given material, on the other hand, is due to the fact that, generally, materials have inhomogeneous microstructures. To the naked eye, it may appear that a material is composed of continuous homogeneous material. However, microscopic examination reveals that metals, for example, are comprised of discontinuous inhomogeneous material having individual crystalline grains, pores, and defects. Cracks nucleate and grow on the order of grain size according to the properties of the individual grains, with growth rates as varied as grain properties. As these cracks grow, the rate and behavior of the crack approaches the bulk or average properties of the material. Therefore, for large cracks, traditional crack growth methods are appropriate. Traditional methods, however, cannot determine the probability of crack initiation or describe crack growth of nearly grain-sized cracks. In many applications, failure can occur before the fatigue damage reaches the long crack stage because although the damage is very small, the strain energy associated with the damage is very high.
As a result, there exists a need for a method and apparatus for accurately predicting failure that accounts for the microstructural properties of materials and sequential variation in the loading, and relates them to fatigue scatter. In particular, there exists a need for a method and apparatus for accurately predicting electronic component, module, and/or system failure that accounts for variability without the need for extensive test data on the electronic component and/or system. This can be accomplished by accurately assessing a component's life by determining the effects of fatigue on it.
In short, fatigue must be considered a primary mechanism behind electronics failure, and applying the types of modeling techniques advocated in this application can lead to major improvements in the understanding of electronic system reliability.
U.S. Pat. Nos. 7,006,947, and 7,016,825, both of which are incorporated by reference in their entirety, have shown that grain by grain simulation of the materials from which individual components are made has proven successful for fatigue life prediction on large structural components, as well as provide prognoses of failure when using measured data. However similar approaches to predict reliability of small-scale components such as interconnects has not been thought possible or practical.
The discussion of the background to the invention herein is included to explain the context of the invention. This is not to be taken as an admission that any of the material referred to was published, known, or part of the common general knowledge as at the priority date of any of the claims.
Throughout the description and claims of the specification the word “comprise” and variations thereof, such as “comprising” and “comprises”, is not intended to exclude other additives, components, integers or steps.