1. Field of the Invention
The present invention generally relates to techniques for performing electronic prognostics for components in a system. More specifically, the present invention relates to a method and an apparatus that performs a real-time root-cause-analysis for a degradation event associated with a component based on degrading telemetry signals.
2. Related Art
An increasing number of businesses are using computer systems for mission-critical applications. In such computer systems, a component failure can have a devastating effect on the business. For example, the airline industry is critically dependent on computer systems that manage flight reservations, and would essentially cease to function if these systems failed. Hence, it is critically important to be able to measure component reliabilities in such systems to ensure that they meet or exceed reliability requirements.
Typically, component reliabilities are determined through “reliability-evaluation studies.” These reliability-evaluation studies can include: “accelerated-life studies,” which accelerate the failure mechanisms of a component; or “repair-center reliability evaluations,” wherein the vendor tests components returned from the field. These types of tests typically involve using environmental stress-test chambers to hold and/or cycle one or more stress variables (e.g. temperature, humidity, radiation, etc.) at levels that are believed to accelerate subtle failure mechanisms within a component. The components under test are then placed inside the stress-test chamber and subjected to those stress conditions.
While the components are under stress in the stress-test chamber, specific physical variables which indicate the health of the components are being monitored. Outputs from this monitoring process can be used to generate time series data for these variables, which are referred to as “telemetry signals.” These telemetry signals can be analyzed in real-time using electronic prognostic techniques to detect anomalies and/or the onset of degradation in the telemetry signals, which can indicate potential component failures.
When component failures are detected or predicted by the electronic prognostics techniques, the faulty telemetry signals collected during the degradation processes are typically recorded for a subsequent root-cause analysis operation, which attempts to determine the “root-cause” of a failure. Knowing the root-cause of a failure allows similar failure events to be corrected or eliminated in the future.
Typically, the root-cause analyses are performed “postmortem,” i.e., as a post-processing step after a component is determined to have failed. As a consequence, postmortem root-cause analysis techniques rely on a priori knowledge of possible failures that can occur in the component of interest. Hence, these techniques require a comprehensive library to be created beforehand which includes all of the failure modes. These failure modes are typically extracted from the past failure events, and are stored in the failure mechanism libraries. Next, the newly-recorded faulty telemetry signals are compared against the failure modes in the failure mechanism library, and a root-cause of failure can be identified if the faulty telemetry signal matches a particular failure mode in the library.
Unfortunately, such a priori knowledge of failure mechanisms is not always available for each failure event. Consequently, many root-cause analyses have to be performed with little or no information on the failure behavior of the components while they transition from a healthy state to a defective state. In such cases, a root-cause analysis may require a physical examination of the faulty components, which can be an extremely cumbersome task. For example, in many cases such physical examination requires the system containing the faulty component be disassembled so that the faulty component can be accessed. However, doing so can destroy evidence associated with the failure mechanism.
Hence, what is needed is a method and an apparatus that facilitates performing a root-cause analysis based on little or no a priori knowledge of the failure mechanism.