1. Field of the Invention
The present invention relates to techniques for determining the quality and/or the reliability of a component in a system. More specifically, the present invention relates to a method and apparatus for determining the quality and/or the reliability of a component by monitoring dynamic variables associated with the component during an in-situ stress-test of the component.
2. Related Art
Computer system manufacturers routinely evaluate the quality and/or the reliability of individual computer system components to ensure that the computer systems manufactured from the components meet or exceed quality and reliability requirements of their customers. Typically, component reliabilities are determined through “reliability-evaluation studies.” These reliability-evaluation studies can include “accelerated-life studies,” which accelerate the failure mechanisms of a component, or “repair-center reliability evaluations,” wherein the vendor tests components returned from the field. On the other hand, the quality of components is determined through “burn-in screens,” which are designed to eliminate early failures prior to shipping to customers. These types of tests typically involve using environmental stress-test chambers to hold and/or cycle one or more stress variables (e.g. temperature, humidity, radiation, etc.) at levels that are believed to accelerate subtle failure mechanisms within a component. The components under test are then placed inside the stress-test chamber and subjected to those stress conditions.
Note that during a reliability-evaluation, even when subjected to the stress conditions, the components typically need to remain in the stress-test chambers for long time periods, which may range from hours to months, before failures are detected. Furthermore, it is usually not possible to apply pass/fail tests for the components or systems while they are in the stress-test chambers. Consequently, at predetermined time intervals, the components or systems under stress-test are typically removed from the stress-test chambers and are tested externally (referred to as “ex-situ” tests) to count the number of components that have failed. The components or systems that have not failed are then returned to the stress-test chambers and are tested further. In this way, the reliability-evaluation study generates a history of failed and not-failed component/system counts at discrete intervals, e.g. 100 hours, 200 hours, 300 hours and so on.
Unfortunately, these ex-situ reliability-evaluation studies which use stress-test chambers are both expensive and time-consuming. This is a consequence of the fact that most reliability-evaluation studies have to deal with a tradeoff between the number of units being stress-tested, and the time they remain under stress-test. More specifically, if the components are expensive and/or in very short supply (e.g. pre-manufacturing prototype components) so that only a few components can be stress-tested, then extremely long test windows are needed to get a statistically significant number of failures in order to draw meaningful conclusions for reliability. On the other hand, if the components are inexpensive and readily obtainable, so that a large number of units can be placed under a stress-test, when the stress-test is halted to evaluate how many units have failed (which involves measuring every unit under test), the ex-situ evaluation becomes extremely resource-intensive and consequently expensive.
One solution to the above-described problems is to test and evaluate component reliabilities while the components are under stress in the stress-test chamber. This requires the ability to monitor specific physical variables which indicate the health of the components and which can be obtained directly from the components under test. However, a primary challenge in effectively performing this type of reliability-evaluation technique is to come up with a way to precisely detect when a component is degrading under stress conditions. Co-pending patent application Ser. No. 11/219,091 (listed above) applies a Sequential Probability Ratio Test (SPRT) to those physical variables under surveillance. This technique can accurately identify the incipience or onset of gradual component degradation when the monitored variables are relatively stationary (in the statistical sense) with time.
Unfortunately, for many reliability studies, the telemetry signals from the stress-test chamber contain significant dynamic behavior. This can occur, for example, in accelerated-life studies if the system is actively varying one or more stress variables, causing the stress variables to introduce dynamics into the telemetry metrics. The dynamic behavior also occurs in the early failure rate studies, burn-in screens and repair-center reliability evaluations. In these cases, the telemetry signals cannot be assumed to be stationary. Hence, the stationarity assumption, which is important for the SPRT technique to operate correctly, may be invalid.
Hence, what is needed is a method and apparatus for detecting degradation in a component under test by monitoring telemetry signals which exhibits dynamical behaviors.