1. Field of the Invention
The present invention relates to techniques for enhancing availability and reliability within computer systems. More specifically, the present invention relates to a method and an apparatus for proactively detecting and correcting a failure sequence that leads to undesirable computer system behavior.
2. Related Art
As electronic commerce grows increasingly more prevalent, businesses are increasingly relying on enterprise computing systems to process ever-larger volumes of electronic transactions. A failure in one of these enterprise computing systems can be disastrous, potentially resulting in millions of dollars of lost business. More importantly, a failure can seriously undermine consumer confidence in a business, making customers less likely to purchase goods and services from the business. Hence, it is desirable to be able to detect and correct failure sequences in computer systems before catastrophic system failures occur. (Note that the following disclosure and attached claims use the term “failure sequence” to refer to a sequence that leads to undesirable system behavior, such as a system crash or a system overload. This term is not meant to be limited to sequences that lead to system failures.)
One strategy for dealing with complex systems in safety-critical and mission-critical operations is called Condition Based Maintenance (CBM). The concept underpinning CBM is straightforward: proactively detect component failures, then isolate, replace, repair, or reconfigure before the component failures lead to a total system failure. However, in practice, it is difficult to prepare those learning CBM maintenance procedures for the psychological stresses involved in receiving and acting upon multiple sources of incoming information defining the state of the system, then taking correct actions expeditiously before cascading failures can lead to system catastrophe. Aviation pilots are first introduced to this environment in full-fidelity flight simulators. Nuclear reactor operators are similarly trained with full-fidelity plant simulators. In both cases, the large investment in simulation technology and in re-creating realistic human-computer interfaces (HCIs) is warranted because of the consequences of under-training, or training with unrealistic scenarios.
Although business critical eCommerce datacenters do not have life-critical aspects as in the foregoing examples; the psychological stresses and potential for cognitive-overload scenarios are nevertheless very high. In fact, when multiple system components fail at the same time human system operators can suffer from cognitive overload, which impedes the human operator's ability to take effective remedial actions. For example, there may be situations wherein error messages are coming from multiple locations in the software “stack” and the human system operator gets to the point of cognitive overload.
Some systems aid the human operator by monitoring system parameters, such as the amount of free memory, and will trigger an alarm if a parameter exceeds or falls below a pre-specified univariate threshold value. This enables the system or the system operator to perform a remedial action before the system crashes.
Unfortunately, univariate thresholds are often poor predictors of an impending system crash. In many cases, a univariate threshold will fail to predict a crash until it is too late to take remedial action. Note that it is possible to set a threshold lower (or higher) to make it more likely to predict a crash. However, doing so result in a “false positive” detection of undesirable system behavior, which can cause remedial actions to be taken when they are not necessary and can consequently lead to inefficient resource utilization.
Hence, what is needed is a method and an apparatus that more effectively detects and corrects failure sequence that leads to undesirable computer system behavior.