1. Field of the Invention
The present invention relates to techniques for performing a root-cause analysis on a faulty computer system. More specifically, the present invention relates to a method and an apparatus that automatically identifies a failure mechanism associated with a signal measured from a faulty component in a computer system.
2. Related Art
Modern server systems are typically equipped with a significant number of sensors which monitor signals during the operation of the server systems. For example, these monitored signals can include temperatures, voltages, currents, and a variety of software performance metrics, including CPU usage, I/O traffic, and memory utilization. Outputs from this monitoring process can be used to generate time series data for these signals which can subsequently be analyzed to determine how well a computer system is operating.
One particularly useful application of this analysis technique is to facilitate “proactive fault-monitoring” to identify leading indicators of component or system failures before the failures actually occur. Typically, this is achieved by detecting anomalies in the signals which may potentially lead to system failures.
For example, a system can detect anomaly in a monitored signal when the monitored signal exceeds a threshold level. More specifically, critical system variables can be measured and recorded at predetermined intervals, and the collected measurement values can be compared against some predetermined threshold values. If a particular variable overshoots its corresponding threshold, a fault condition can be flagged.
Another existing technique detects anomalies in monitored signals through pattern recognition. This technique compares measured time series data against learned “normal” signal patterns and detects anomalies in the measured time series data if abnormal correlation patterns are found. This technique is described in U.S. patent application Ser. No. 10/903,160, entitled, “Method for High Sensitivity Detection of Anomalous Signals in Systems with Low Resolution Sensing,” by inventors Kalyan Vaidyanathan, Aleksey Urmanov, and Kenny C. Gross.
However, the above-described techniques have been developed to provide early fault detection rather than to identify the root cause of a fault condition. In other words, once an anomaly is detected using any of the above approaches, it is still left to a human repair engineer to diagnose the root cause of the anomaly. Unfortunately, a monitoring system that lacks the ability to identify the root cause of a fault cannot provide correct solutions for dealing with the fault.
In practice there are some failure mechanisms which are characterized by distinctive and reproducible dynamic signatures in the corresponding recorded time series data. Two specific examples which occur in certain types of computer servers are: (1) a “restart” of one of two redundant power supplies which generates a transient dynamic voltage pulse that can cause a machine to crash; and (2) a defective MPI-type socket undergoing a “reset” event that can cause the system board core voltage to spike downward and then slowly recover. In both of these examples, a field engineer monitoring the dynamic telemetry signature can immediately recognize the “fingerprint” of the underlying degradation mechanisms. However, it is not possible to have humans looking at these telemetry signatures on a 24×7 basis.
Hence, what is needed is a method and apparatus for automatically performing a root cause analysis to identify possible failure mechanisms for anomalous telemetry signals without the above-described problems.