1. Field of the Invention
The present invention relates to techniques for determining the reliability of a component in a system. More specifically, the present invention relates to a method and apparatus for determining the reliability of a component by identifying the onset of hardware degradation during an accelerated-life study of the component.
2. Related Art
An increasing number of businesses are using computer systems for mission-critical applications. In such applications, a component failure can have a devastating effect on the business. For example, the airline industry is critically dependent on computer systems that manage flight reservations, and would essentially cease to function if these systems failed. Hence, it is critically important to measure component reliabilities to ensure that they meet or exceed the reliability requirements of the computer system.
Unfortunately, determining the reliability of a component can be very time consuming if reliability testing is performed under normal operating conditions. This is because, under normal conditions, a highly reliable component can take an inordinate amount of time to fail.
Consequently, component reliabilities are often determined using “reliability-evaluation studies.” These reliability-evaluation studies can include “accelerated-life studies,” which accelerate the failure mechanisms of a component, or burn-in studies, which determine if a particular component is functioning properly prior to being shipped to customers. These types of studies subject the component to stressful conditions, typically using environmental stress-test chambers to hold and/or cycle one or more stress variables (e.g. temperature, humidity, radiation flux, etc.) at levels that are believed to accelerate subtle failure mechanisms within the component. Note that, even under stress conditions, components typically need to be tested for time periods that may range from hours to months. Furthermore, it is usually not possible to test the components or systems while they are in the stress-test chambers. Consequently, to test the systems or components under stress, they are typically removed from the stress-test chambers and tested externally to count the number of components that have failed. The systems that have not failed are then returned to the stress-test chambers and are tested further. In this manner, a reliability-evaluation study generates a history of failed and not-failed system/component counts at discrete time intervals.
Unfortunately, reliability-evaluation studies are typically expensive and time consuming. These studies typically involve making a tradeoff between the number of units under test, and the time they are subjected to the stress test. If the components are expensive and/or in very short supply (e.g. pre-manufacturing prototype components, or high-end computer systems), long test windows are needed to get a statistically significant number of failures to draw meaningful age-dependent reliability conclusions. On the other hand, if the components are cheap and readily obtainable, such that a large population of components can be placed under stress, the ex-situ functional testing becomes resource-intensive because the stress-test needs to be halted frequently to evaluate how many units have failed.
Hence, what is needed is a method and an apparatus for determining the reliability of a component using an reliability-evaluation study technique without the above-described problems.