1. Field of the Invention
The present invention relates to techniques for proactively detecting impending problems in computer systems. More specifically, the present invention relates to a method and apparatus for mitigating quantization effects in telemetry signals while using the telemetry signals to detect impending problems in a computer system.
2. Related Art
Modern server computer systems are typically equipped with a significant number of hardware and software sensors which continuously monitor signals during the operation of the computer systems. Results from this monitoring process can be used to generate time series data for these signals, which can subsequently be analyzed (for example, using pattern-recognition techniques) to determine how a computer system is operating. One particularly useful application of this time series data is to perform real-time ‘proactive fault monitoring’ to identify leading indicators of component or system failures before the failures actually occur.
In many proactive-fault-monitoring applications, pattern-recognition techniques, such as those based on nonlinear kernel regression, are used to model the complex interactions among multivariate signal behaviors in the proactive fault monitoring. Using these techniques, a pattern-recognition model is first constructed during a training phase, where correlations among the multiple input signals are learned. In a subsequent monitoring phase, the pattern-recognition model is used to estimate the values of each input signal as a function of the other input signals. Significant deviations between the estimated values and measured values of the same signal are used to detect potential anomalies in the computer system under surveillance.
There are a number of criteria by which the performance of a model can be evaluated, including: (1) accuracy: ability of the model to correctly estimate the value of a signal in the absence of faults in the computer system; (2) robustness: ability of the model to maintain accuracy in the presence of signal disturbance (i.e., estimates should not track errors in a faulty signal); and (3) spillover: ability of the model to isolate a faulty signal (i.e., estimates of signal A should not be affected by a fault in signal B). Moreover, it is useful from a computational standpoint to minimize the number of input signals included in the model without compromising the performance of the model. This is because the computational cost for the pattern-recognition computations generally scales with the square of the number of input signals in the model.
Note that it is often useful to select an appropriate subset of signals from all the available input signals to be included in the pattern-recognition model. Moreover, pattern-recognition techniques typically attain high sensitivity by aggregating estimates from multiple models, each of which uses the same signals, but different time observations or samples. Unfortunately, conventional approaches for choosing an appropriate subset of signals and/or appropriate time samples for a pattern-recognition model have been primarily based on trial-and-error techniques in combination with rudimentary linear-correlation analysis, which are not sufficient to predict the nonlinear-correlation behaviors among the input signals. More significantly, there are often a large number of available signals in a computer system (e.g., greater than 1000 signals in a high-end server system). In these cases, computational cost can make it intractable to examine all possible combinations of these signals to determine the optimal subset to be included in a model using the conventional approaches.
Moreover, in some computer systems low-resolution analog-to-digital chips (e.g., 8-bit A/D chips) are used to convert analog telemetry signals (such as temperature, voltage, and current signals) into digital telemetry signals. These low-resolution A/D chips often generate digitized telemetry signals that are severely quantized, which means that values for the quantized telemetry signals are reported using only a few ‘quantization levels.’ Unfortunately, pattern-recognition techniques typically cannot be applied to such low-resolution quantized telemetry signals.
One solution to this problem is to use higher-resolution A/D chips. For example, a 12-bit A/D chip provides 16-times more quantization levels than an 8-bit A/D chip. Unfortunately, such higher-resolution A/D chips are expensive, and retrofitting legacy systems that contain low-resolution A/D chips with such higher-resolution A/D chips is impractical.
Hence, what is needed is a computationally efficient technique for generating models from quantized telemetry signals for use in proactive fault monitoring without the problems described above.