1. Field of the Invention
The present invention relates to techniques for enhancing reliability and availability within computer systems. More specifically, the present invention relates to a method and an apparatus for proactively monitoring computer system components for faults by using three-dimensional telemetric impulsional response fingerprint (3D TIRF) surfaces in combination with a two-dimensional Sequential Probability Ratio Test (2D SPRT).
2. Related Art
As information technologies become more prevalent, organizations, such as businesses and governments, are increasingly relying on enterprise computing systems to process ever-larger volumes of electronic transactions. A failure in one of these enterprise computing systems can be disastrous, potentially resulting in millions of dollars of losses in productivity and business. More importantly, a failure can seriously undermine consumer confidence in a business, making customers less likely to purchase goods and services from the business. Hence, it is desirable to detect and prevent these failures before the failures actually occur.
To protect against catastrophic system failures, modern computer server systems are typically equipped with a significant number of sensors which monitor signals during operation of the computer systems. Results from this monitoring process can be used to generate time series data for these signals which can subsequently be analyzed to determine how well a computer system is operating.
One of the conventional techniques for detecting impending faults is to employ threshold limit rules while monitoring system variables such as temperature, voltage, current, RPM, etc. This technique generates an alarm condition if a variable level starts to move out of a predetermined range. However, such threshold-limit techniques suffer from transient signal noise which frequently causes false alarms when noise spikes activate alarms. In order to deal with this problem, a statistical Sequential Probability Ratio Test (SPRT) technique has been recently developed to detect impending faults by analyzing monitored time-domain signals. This technique has achieved not only reduced false-alarm probability but also reduced missed-alarm probability.
Unfortunately, both the threshold-limit and the SPRT techniques have a serious limitation: they are passive. In other words, they do not actively probe or perturb conditions of the components under surveillance. Although these techniques can catch many types of faults, other latent faults may appear only in response to dynamic stimulation. An analogy for these latent faults is a car that may have a problem during acceleration. This problem may not reveal itself during idling or while the car is cruising at a uniform speed.
As a remedy to the above limitation, an active-probing technique referred to as Telemetric Impulsional Response Fingerprint (TIRF) has been introduced to facilitate the dynamic assessment of the health of electronic components. Specifically, this technique introduces a subtle perturbation to an electronic component during operation through one or more signal inputs, and then generates the TIRF of the component for one or more observed physical and software variables. Next, the TIRF is compared with a reference TIRF produced from the certified good electronic components of the same type. If the distance between the monitored TIRF and the reference TIRF exceeds a specified threshold, an alarm will be generated.
However, one problem with the TIRF technique is that if the specified threshold distances are set too low, false alarms can arise from spurious data values in the monitored TIRF. To avoid such false alarms, the threshold distances to the corresponding TIRFs can be set higher. However, a higher threshold distance allows degradation within the monitored system to develop further before an alarm occurs.
Hence, what is needed is a method for achieving higher-sensitivity in detecting subtle incipient of failure mechanisms using the TIRF technique without the above-described problems.