A reliable understanding of product failure is important at all phases of a product's life. Key, however, are the design phase—before a product is even built and tested—where the factors that contribute to increased longevity are important, and the operating phase, where a reliable estimation of a particular product's useful lifetime based on its specific use history is invaluable.
Furthermore, most products comprise many separate parts. The reliability of the product is a function of the reliability of all of the parts, including connections between the parts. Disregarding operator error—e.g., settings on the production line were grossly mis-set—products typically fail because of a material failure in a particular part. Materials can fail because repetitive stress applied over time causes internal micro-structures (e.g., grain domains) to move or distort sufficiently to nucleate a discontinuity, which leads to the propagation of a small crack, leading to a larger crack, and finally to outright failure of the part through an inability to sustain the loads applied to it. For example, electronic devices can fail when the material of an interconnect or its solder joint fails. Solder joints are particularly vulnerable to fatigue failure, as further discussed herein. As systems are powered up and down, these interconnect elements are subject to thermal gradient cycling, which, working in combination with vibration, impact, and shock loadings, create dynamic conditions conducive to fatigue.
Traditionally, reliability of electronic systems has been estimated based on engineering judgment about the applicability of past test results. The traditional measures of electronics systems reliability have been based on mean-time-to-failure (MTTF) and mean-time-between-failure (MTBF). Historically, these measures of reliability were developed based on the assumption that product failure rate is constant, i.e., a product is as equally likely to fail on its first day of use as it is ten years after being put into service. With today's computational design tools and techniques combined with better manufacturing controls, early failure is rare and long term wear-out, such as fatigue, are the predominant concerns. Given the latter, MTTF or MTBF are no longer good measures of reliability.
Imagine the fantasy of perfect reliability: applied to a single instance of a product, it would mean that the product always worked exactly as intended—however it was used, whatever conditions it was exposed to. Perfect reliability is hard enough to achieve in a single product, but very few products are built as single instances. Even without considering normal manufacturing variance, inherent randomness of the grain structure within a product's materials dictates that no two instances of a product could ever be built exactly alike. Just that level of variation at the grain structure level can give rise to differences in failure outcomes between different instances of a given product. Real world manufacturing processes then add further variance to different instances of the product. Once the product is put in service, the particular use history over the life-time of any instance will be unique (some instances may be overly stressed, others treated gently), and will take place under differing external conditions (heat, cold, shock loads, dust, etc.), all of which add further variation. Perfect reliability would mean that no instances ever failed—that in spite of manufacturing variances, differing uses, and exposure to differing conditions, nothing ever went wrong.
In normal practice, then, across all instances, all uses, and all exposures, some number of product failures will occur. Coming to a quantitative understanding of these failures—how many are likely to occur, when they are likely to occur, under what conditions they will occur, why they will occur, how to reduce their occurrence, therefore would have significant benefits.
Electronic devices are complex multilayered systems consisting of different materials with inherent variability. Competitive pressures are demanding that electronics be operated under increasingly harsh environments and operating conditions. For example, power supply systems are—compared to other electronic systems—highly susceptible to failure due to the high voltage and current conditions in which they routinely operate. Also, the trend to provide more processing power from smaller and smaller integrated circuits is accelerating. However, even electronic devices fail eventually, regardless of how well they are engineered. Unlike mechanical systems, electronic systems rarely actively display conventional fault signals prior to failure. Failure can frequently be attributed to structural, material, or manufacturing defects.
Electronic devices are particularly reliant on the integrity of interconnects or solder bonds. The reliability of interconnects is a concern because it is widely expressed in the scientific literature that fracture failures in solder joints account for up to 70% of failures in electronic components (see, e.g., Zeng, K., Tu, K. N., (2002) “Six cases of reliability study of Pb-free solder joints in electronic packaging technology”, Materials Science and Engineering R, Vol. 38, pp. 55-105, incorporated herein by reference). Interconnect or solder degradation and failure is principally due to thermomechanical and vibratory fatigue mechanisms. As a device is operated, thermal and/or mechanical loads are induced in it. These loads are translated from the device level to the localized interconnect level. Thermal gradient cycling during normal system operation eventually results in thermo-mechanical fatigue induced failure.
Failure analysis has revealed that actual component loadings are often well below the steady loads that can cause failure. What distinguishes failures attributed to actual loadings is the fact that the loads have been applied repeatedly. This is classic fatigue. It is estimated that perhaps 90% of all machine failures are caused by fatigue. Fatigue, or more specifically fatigue crack initiation and growth, is therefore a damage mechanism that degrades the reliability and safe life of components subjected to repeated loads. Such loads could be from thermal, vibratory, shock, and electromagnetic loadings. Although less obvious, this same mode of failure applies to static structures as well as those that are in motion. Static structural components are subject to vibrations and movements created from thermal expansion and contraction. Electronic systems are static structures that are subject to these same types of phenomena. Though the movements may be slight, large cyclic forces can result. Designing for fatigue has been difficult hitherto because fatigue typically manifests itself with greatly varying effects on similar components.
Fatigue can occur in any device with either static or moving components, even where the movement is imperceptible, such as is the case with interconnects or solder joints, where there can be very small displacements but very large strains (displacements per unit length). Component failure is frequently insidious, with no prior indication that damage had occurred. Sometimes fatigue can cause intermittent failure. For example, an initiated fatigue crack in solder can cause the device in which the solder is found to operate sporadically due to metallic contact bridging.
Solder joints are particularly vulnerable to fatigue failure. As systems are powered up and down, interconnect elements are subject to thermal gradient cycling, which, working in combination with vibration, impact, and shock loadings, creates dynamic conditions conducive to fatigue. The typical electronics printed circuit board (PCB) manufacturing processes, in which solder is melted and then cooled, creates joints with complex internal grain structures. These grain structures are stressed by the cooling process, and undergo continuous movement in response to these stresses. This movement, which is on-going even as the system is sitting under non-working conditions in a warehouse, is in itself sufficient to contribute to fatigue vulnerability.
A failure at the module or component level often leads to failure of the whole system. Such failures can result in immediate electronic system shutdown with no advanced fault or warning signals, thus preventing the use of conventional fault-to-failure detection approaches as a means of predicting maintenance need. Such failures can also present safety or maintenance concerns, and often result in economic setbacks such as loss of market share when the product's failure rate becomes sufficiently notorious.
The consequences of failure of a product to the immediate user therefore range from minor inconvenience, to major annoyance, or to catastrophe. Repercussions from such failures ultimately transform into consequences for the manufacturers of the product. It is such consequences that motivate product manufacturers to develop rational strategies to minimize occurrence of failure. The strategies vary depending on specific motivating circumstances, but all involve economic considerations and trade-offs. Even if a product has a significant potential to produce catastrophic results, economic trade-offs cannot be ignored (for one can always spend more and take more time testing, to achieve still higher levels of safety).
Engineers have tried to design electronics for high reliability, but most often the reliability information comes very late in the design process. Normally, a statistically significant quantity of reliability data is not obtained until after product launch, and warranty claims from use by consumers have been fielded. This lack of data inspired engineers in the past to make their designs more robust by using safety factors that ensured the designs would meet predefined reliability goals.
Since similar components frequently present great lifespan variations (for example, one electronic element might last many years, but another element produced by the same manufacturer could fail in a few months), traditional methods of component design attempt to moot the effects of great uncertainty or scatter in lifespan by applying large safety factors to ensure low probabilities of failure. Safety factors, however, are subjective in nature and are usually based on historical use. Safety factors are likewise unsatisfactory methods for predicting the life of an individual since they are based on historical information obtained from a population. Safety factors are also an unsatisfactory method for quickly and efficiently designing against failure since they rely on historical information obtained from test and component data which may not be available in the design phase.
Now that modern manufacturers are incorporating new technology and manufacturing methods faster than ever before, exactly what safety factor is appropriate in today's complex, state-of-the-art, electronics is seldom, if ever, known with certainty. This complicates the engineering design process. Designed-in safety factors tend to add material or structural components, or add complexity to the manufacturing process. Safety factors are counterproductive where industry is attempting to cut production costs or reduce weight of products.
Additionally, given that true operational performance is so difficult to predict accurately, it is common practice within the electronics industry to achieve high reliability through redundancy. Although redundancy allows an electronic system to continue to operate after one or more components have failed (because another component is present to perform the same role), this practice is costly and presents a barrier to electronics miniaturization. Designing cost effective and highly reliable electronics through maximizing component life therefore requires the ability to reduce reliance on safety factors as much as possible for a given design. In other industries (e.g., aircraft parts), where the true operating life of the component could be much greater than its predicted life, to ensure that no component fails during operation, components are often retired well before exhausting their useful lifetime.
In attempting to reduce reliance on safety factors, designers have developed models for the more prevalent damage mechanisms that lead to failures. Failures can be attributed to many different kinds of damage mechanisms, including electro-migration, buckling, and corrosion. Models for these mechanisms can be used during the design process, usually through deterministic analysis, to identify feasible design concept alternatives. Nevertheless, poor, or less than desired, reliability is often attributed to variability amongst the population of products, and deterministic analysis which utilizes single values for all design, manufacturing, and usage variables to calculate a single value of reliability, cannot account for variability.
Variability affects reliability of electronic systems through any number of factors including loading scenarios, environmental condition changes, usage patterns, and maintenance habits. Even the response of a system to a steady input, such as a constant voltage supply, can exhibit variability due to parameters such as a varying ambient temperature.
Previously, the reliability of electronic devices has also been assessed using empirically-based models. Experimental design is a commonly used tool in which the experimental conditions are systematically varied and a mathematical relationship is “fit” to the data that represents the influence of the conditions to the time or cycles to failure. However, one drawback of this approach is the fact that there is so much variation in the time or cycles to failure that device life can only be conveyed in the form of a statistical average, such as mean time to failure (MTTF) or the mean time between failure (MTBF). Although these statistical averages provide a general sense about average overall failure, they are a hold-over from a time when computer processing power was expensive. They only provide a single point number and offer no insight about real world probabilistic variation, true failure mechanisms or the impact those mechanisms have on how a specific design will react to actual field conditions. Accordingly, although such metrics are appropriate in the context of manufactured fleet lot reliability they lack the fidelity for accurate representation of individual device reliability in the field.
Over the years, probabilistic techniques have been developed for predicting variability and have been coupled with damage models of failure mechanisms to provide probabilistic damage models that predict the reliability of a population. But, given variability, a prediction of the reliability of a population says little about the future life of an individual member of the population.
Historically, testing has been the primary means for evaluating the effects of variability. Unfortunately, testing is slow, expensive and burdensome, and evaluation of every possible source of variability is impractical. Furthermore, it is simply not practical to directly sense the degradation of electronic components. Their damage states are usually structural and, due to their size, structural response signatures of electronic components are not directly monitored. For example, it would be both difficult and expensive to directly sense the cracking of a single emitter wire bond on a circuit board comprised of thousands of emitter wires. Yet, the failure of a single emitter wire can cause the failure of the entire device. None of the electronics industries traditionally used fatigue models to account for the large scatter in the properties of solder weld.
The trend away from physical testing has been forced in part because the cost of physical tests are rising whereas the cost of computer cycles are plummeting, thereby increasing the practicality of replacing the old “test it” paradigm with a “compute it” paradigm.
If there were an effective way to predict the impending failure of an electronic system, module, or component, operators could repair or retire a system before an actual failure, thus avoiding the negative consequences of failure. Thus, accurate prediction of impending failure could have great economic impact on industries whose products rely on electronic components, industries as diverse as aerospace, automotive, communications, medical device, domestic appliance, and related sectors.
In the case of fatigue failure, scatter in component life is quantified by a coefficient of variation (COV) which is usually determined based on a large number of fatigue life tests on many material specimens, or by full-scale testing of prototype electronic systems. Even under well-controlled laboratory tests of annealed smooth specimens at room temperature, the COV varies from less than 10% to over 500% for different interconnect alloys. Thus, the considerable scatter in the fatigue reliability of components in operation may be substantially attributed to considerable scatter of component material fatigue behavior.
Life scatter of components made from a given material, on the other hand, is due to the fact that, generally, materials have inhomogeneous microstructures. To the naked eye, it may appear that a component is composed of continuous homogeneous material. However, microscopic examination reveals that metals, for example, are comprised of discontinuous inhomogeneous material having individual crystalline grains, pores, and defects. Cracks nucleate and grow on the order of grain size according to the properties of the individual grains, with growth rates as varied as grain properties. As these cracks grow, the rate and behavior of the crack approaches the bulk or average properties of the material. Therefore, for large cracks, traditional methods for modeling crack growth, such as elastic fracture, plastic fracture, and combined elastic/plastic fracture mechanics, are appropriate. Traditional methods, however, cannot determine the probability of crack initiation or describe crack growth of nearly grain-sized cracks. In many applications, failure can occur before the fatigue damage reaches the long crack stage because although the damage is very small, the strain energy associated with the damage is very high.
As a result, there exists a need for a method and apparatus for accurately predicting failure that accounts for the microstructural properties of materials and sequential variation in the loading, and relates them to fatigue scatter. In particular, there exists a need for a method and apparatus for accurately predicting electronic component, module, and/or system failure that accounts for variability without the need for extensive test data on the electronic component and/or system. This can be accomplished by accurately assessing a component's life by determining the effects of fatigue on it.
U.S. Pat. Nos. 7,006,947, and 7,016,825, both of which are incorporated by reference in their entirety, have shown that grain by grain simulation of the materials from which individual components are made has proven successful for fatigue life prediction on large structural components, as well as providing prognoses of failure when using measured data. However similar approaches to predict reliability of small-scale components such as interconnects has not previously been thought possible or practical.
The discussion of the background to the technology herein is included to explain the context of the technology. This is not to be taken as an admission that any of the material referred to was published, known, or part of the common general knowledge as at the priority date of any of the claims.
Throughout the description and claims of the specification the word “comprise” and variations thereof, such as “comprising” and “comprises”, is not intended to exclude other additives, components, integers or steps.