1. Field of the Invention
The present invention relates to techniques for monitoring computer systems. More specifically, the present invention relates to a method and an apparatus for high-sensitivity detection of anomalous signals in systems with low-resolution sensors.
2. Related Art
As electronic commerce grows increasingly more prevalent, businesses are increasingly relying on enterprise computing systems to process ever-larger volumes of electronic transactions. A failure in one of these enterprise computing systems can be disastrous, potentially resulting in millions of dollars of lost business. More importantly, a failure can seriously undermine consumer confidence in a business, making customers less likely to purchase goods and services from the business. Hence, it is critically important to ensure high availability in such enterprise computing systems.
To achieve high availability in enterprise computing systems it is necessary to be able to capture unambiguous diagnostic information that can quickly pinpoint the source of defects in hardware or software. Some high-end servers, which cost over a million dollars each, contain hundreds (or even thousands) of physical sensors that measure temperatures, voltages and currents throughout the system.
During the design process for high-end enterprise computer servers, a decision has been made about whether to use inexpensive 8-bit sample-and-hold analog-to-digital converters for physical sensors in the system instead of, for example, more-expensive 16-bit analog-do-digital converters. During system operation, a diagnostic software module polls these sensors on a regular basis, say once a minute, and compares the values against specified warning and critical thresholds. When a value exceeds a critical threshold, the diagnostic software powers off the component or shuts down the entire system to protect expensive assets.
The analog-to-digital conversion process can cause a loss of precision due to the use of a limited number of discrete values used to represent the original continuous signal. For example, with an 8-bit digital representation, a continuous signal is represented by only 28=256 digital values. In many cases, if one is interested only in protection of assets, such a representation is adequate. However, when sophisticated statistical methods are to be employed for advanced monitoring of the high-end servers, this coarse quantization of the digitized signals can severely limit the applicability of many surveillance techniques that rely on precise measurement to detect anomalous signals before they reach a critical threshold.
FIG. 1 illustrates quantized and measured values versus time for an exemplary voltage signal within the system. (Note that the signal being monitored could also be a temperature, a current, or some other physical parameter in the system.) The voltage signal is quantized into several 0.01 volt bins as shown on the left-hand portion of FIG. 1. Each of these bins corresponds to one of the possible digital values in a quantized digital representation of the signal. However, these quantized values do not readily indicate the rising trend of the monitored signal, which is clearly evident that appear on the measured values in the right-hand portion of FIG. 1.
Some systems attempt to mitigate the problems caused by this quantization by averaging or integrating the samples received from the analog-to-digital converters. The most common use of averaging/integration is to filter power line noise by integrating over one or several power line cycles. Statistically, averaging over a period of time reduces the variance of the noise component by a factor of √{square root over (N)} where N is the number of individual values over which the average is computed. Yet another useful effect of averaging is that the resulting signal has more distinct values than the original coarsely quantized signal.
It is possible to average measurements from the original quantized signal, but that drastically reduces the number of available measurements and lengthens the time to decision by subsequent statistical procedures. On the other hand, averaging with a sliding window would preserve the number of measurements. However, a sliding window can introduce unwanted serial correlation in the resulting signals, making the subsequent analysis more complicated.
Imposing a threshold limits on stationary current and voltage signals is the present practice throughout the computing industry. However, if there is noise in the process, and if the thresholds are set too closely, one can obtain false alarms from spurious noise values that have no performance significance. False alarms can result in extremely costly shutdowns of system boards or entire servers. As a result, threshold limits are frequently set at fairly wide levels (±5% of the nominal mean is typical). Research has shown that many failures appear as signal anomalies that are well within the typical threshold limits.
Hence, in order to detect these anomalies, what is needed is a method and an apparatus for overcoming the “quantization effects” of the inexpensive 8-bit analog-to-digital converters described above.