1. Field of the Invention
The present invention relates to techniques for testing computer systems. More specifically, embodiments of the present invention relate to a technique for determining an optimal stress test or combination of stress tests to characterize computer-system reliability.
2. Related Art
Many precursors of component failures in computer systems, as well as the associated failure mechanisms, can only be determined by applying a stressful load onto the computer systems. For example, the stressful load may be applied for a period of time (typically, between a few hours and 24 hrs) in an attempt to trigger a fault. This technique is often used during root-cause analysis (RCA) and to confirm intermittent failures in computer systems that are returned by customers. Typically, a variety of stress tests are used for these purposes, each of which applies a different load, and thereby stresses different components in a given computer system.
During many of these stress tests, such as during an RCA for problems on system boards, the underlying effect of interest is temperature dynamics, which can trigger subtle failure mechanisms that cause intermittent failures. For example, these failure mechanisms may include: solder fatigue, interconnect fretting, delamination of bonded components, stresses caused by non-coplanarity of stacked components, and/or deterioration of connectors. Some stress tests are known to cause the temperatures of processors (or processor cores) and ASICs to go up significantly (for example, by 6-12 C). Moreover, temperature cycling accelerates the aforementioned failure mechanisms even more, because many of these failure mechanisms are associated with the cumulative effect of temperature cycling and temperature gradients in the computer systems.
Unfortunately, existing stress tests do not efficiently increase the occurrence of many of the failure mechanisms that affect computer systems. Moreover, for a given stress test and/or failure mechanism, the optimal test conditions for a given computer system are typically not known. Consequently, stress tests are often performed for a long time or on a large population of suspected computer systems in an attempt to trigger sufficient failures to enable proper RCA.
Hence, what is needed is a technique for characterizing stress tests to determine the optimal stress test and/or combinations of stress tests, as well as the associated test conditions, without the above-described problems.