1. Field of the Invention
The present invention relates to techniques for enhancing the reliability of computer systems. More specifically, the present invention relates to a method and an apparatus for efficiently clustering telemetry signals within a computer system to facilitate computer system monitoring.
2. Related Art
As electronic commerce grows increasingly more prevalent, businesses are increasingly relying on enterprise computing systems to process ever-larger volumes of electronic transactions. A failure in one of these enterprise computing systems can be disastrous, potentially resulting in millions of dollars of lost business. More importantly, a failure can seriously undermine consumer confidence in a business, making customers less likely to purchase goods and services from the business. Hence, it is important to ensure high availability in such enterprise computing systems.
To achieve high availability in enterprise computing systems, it is necessary to be able to capture unambiguous diagnostic information that can quickly pinpoint the source of defects in hardware or software. If systems have too little event monitoring, when problems crop up at a customer site, service engineers may be unable to quickly identify the source of the problem. This can lead to increased down time, which can adversely impact customer satisfaction and loyalty.
Fortunately, high-end computer servers, such as those manufactured by SUN Microsystems, Inc. of Santa Clara, Calif., are now equipped with over 1000 sensors that measure variables such as temperature, voltage, current, vibration, and acoustics. Software-based monitoring mechanisms also monitor system performance parameters, such as processor load, memory and cache usage, system throughput, queue lengths, I/O traffic, and quality of service. For example, SUN's telemetry harness collects over 25,000 soft variables in real time.
However, it is neither feasible nor desirable to build a pattern recognition engine to monitor as many as 25,000 variables concurrently. This is because, in general, not all the signals are correlated with each other, and the computational costs associated with analyzing so many signals concurrently is prohibitively high.
Among all the collected variables, many signals are entirely uncorrelated with each other. On the other hand, there are clusters of signals among which there is a high degree of correlation. Since signals from disparate clusters are not closely correlated, pattern recognition mechanisms, which derive information from correlations between signals, perform poorly if fed streams of data from uncorrelated of signals.
Furthermore, the computational complexity of pattern recognition increases quadratically with the number of monitored signals. A system can, therefore, substantially reduce computational costs by dividing the colleted signals into clusters and by monitoring each cluster separately.
Moreover, monitored signals often have time-varying phase shifts with respect to each other. These time-varying phase shifts are associated with the “speeding up” and “slowing down” of individual processes. Such dynamic phase changes may interfere with the processes of clustering and monitoring telemetry signals from a computer system.
Hence, what is needed is a method and an apparatus for efficiently clustering a large number of telemetry signals to facilitate accurate and efficient computer system monitoring.