1. Field of the Invention
The present invention relates to techniques for improving the availability of computer systems. More specifically, the present invention relates to a method and an apparatus for improving the availability using a multi-dimensional sequential probability test (SPRT) to proactively detect a failure condition before a resulting failure occurs.
2. Related Art
As information technologies become more prevalent, organizations, such as businesses and governments, are becoming more dependent on enterprise computing systems to process ever-larger volumes of electronic transactions. A failure in an enterprise computing system can be disastrous, potentially resulting in millions of dollars of losses in productivity and business. In addition, a failure can seriously undermine consumer confidence in a business, making customers less likely to purchase goods and services from the business. Hence, it is desirable to detect and correct these failures before they lead to catastrophic system failures.
One of the major causes of computer system failures is variations in operating conditions. During execution, computer systems typically require that operating parameters, such as heat or vibration, stay within a predefined range. If the parameters vary beyond the operating range, system components can fail.
There are many reasons that can cause the operating parameters to vary beyond the safe operating range. For example, a defective power supply, which operates at a high temperature, can cause a part of a system board to overheat beyond the safe operating temperature range. Similarly, a defective fan motor may vibrate excessively, causing the system board to vibrate, which can cause components on the system board to fail. If such anomalies are not detected in a timely fashion, they can result in catastrophic system failures.
To detect such anomalies, computer systems often employ threshold-based monitoring systems. A threshold-based monitoring system monitors various system parameters and determines whether each parameter is operating within a specified range. If the value of a monitored parameter goes out of the range, the threshold-based monitoring system generates a warning.
Unfortunately, threshold-based monitoring systems have many drawbacks. One of the main drawbacks is that the accuracy of a threshold-based monitoring system depends heavily on the accuracy of the transducers used to measure system parameters. Hence, if the transducers are imperfect and return noisy signals, they can cause the threshold-monitoring system to malfunction. Moreover, process variations during the sensor manufacturing process can cause measurement differences between different sensors. These measurement differences can also cause the threshold-monitoring system to malfunction.
A present method to overcome these drawbacks is to set wide thresholds, so that the monitoring system does not generate a large number of false alarms. Unfortunately, when thresholds are set widely, the monitoring system typically detects failure conditions at an advanced stage, by which time it is too later to perform preventive maintenance. Detecting failures at such an advanced stage usually leads to forcibly shutting down the computer system for maintenance purposes, which may result in loss of productivity and business.
Hence, what is needed is a monitoring system that accurately detects system anomalies at an early stage.