1. Field of the Invention
The present invention relates to techniques for monitoring the health of components within a cluster of components. More specifically, the present invention relates to a method and apparatus for detecting multiple anomalies in a cluster of components.
2. Related Art
For mission-critical systems, it is desirable to minimize downtime. One technique for minimizing downtime is to provide redundancy. For example, a fail-over mechanism can be used to automatically switch from a system that has failed to a healthy system. Unfortunately, fail-over mechanisms only take effect after a system has failed.
It is also desirable to be able to determine the reliability of components used in mission-critical systems so that only components with a mean-time before failure (MTBF) that exceeds the system specification can be used. One technique for determining the reliability of components is to perform accelerated-life studies where components are placed in stress-test chambers. However, it is typically not possible to apply pass/fail tests for components (or systems) being stressed while the components are in stress-test chambers. In practice, the components under stress are periodically removed from the stress-test chambers and are tested to determine the number of components that have failed. The components that have not failed are then returned to the stress-test chambers and are subjected to the desired accelerated-stress conditions. At the end of the accelerated-life study, a history of failed versus healthy component counts at discrete time intervals is generated (e.g., at 100 Hrs, 200 Hrs, 300 Hrs, etc.). This history can be used to predict the reliability of the components. Unfortunately, stopping an accelerated-life study to externally test the components is costly and time-consuming.
Even if a system is populated with components which are deemed reliable (e.g., having an MTBF greater than required by the system specification), these components can still fail prematurely. For example, operating a system in extreme heat can cause components in the system to fail prematurely. Hence, it is desirable to periodically monitor the components during operation of the system to determine whether the components are at the onset of degradation. If so, a remedial action (e.g., replacing a degrading component) can be performed preemptively to prevent an unexpected system failure.
One technique for monitoring the health of components is to use sensors within a system to detect the onset of degradation of components within the system. Unfortunately, as the number of components to be monitored increases, the number of sensors required to monitor these components increases, which increases the cost and computing resources required to process the sensor data.
Hence, what is needed is a method and an apparatus for detecting anomalies in a cluster of components without the problems described above.