1. Field of the Invention
The present invention relates to signal processing. More specifically, the present invention relates to a method and an apparatus that correlates and aligns signals for computer system performance parameters.
2. Related Art
The increasing complexity of server systems pressures support services on two fronts. First, it increases support cycle times. Second, it drives up the cost of labor, both in time-per-support incident, and in the acquisition costs of expertise. While many costly system flaws exhibit subtle signs before the customer experiences a problem or an outage, these indicators are difficult to discern and more difficult to match to impending problems.
Fault detection in complex systems typically requires costly on-line monitoring and expertise. Conventional approaches to identifying faults, which combine event correlation and threshold-based rules, have proven inadequate in a variety of safety-critical industries with complex, heterogeneous subsystem inputs not dissimilar to those from enterprise computing. Fundamentally, while many high-end servers are already rich in instrumentation, the data produced by the instrumentation are complex, non-uniform, and difficult to correlate. Improved real-time monitoring of system performance metrics, coupled with an improved Fault Management Architecture (FMA), provide key enabling technologies that can help in proactively identifying incipient faults and decreasing support costs.
Some systems apply pattern recognition techniques to continuously monitored computer system performance parameters to identify faults. However, the effectiveness of pattern recognition in discerning incipient faults in noisy process data is highly dependent on the quality of information available from the instrumentation.
One challenge that has arisen in connection with the above objectives is deciding which signals are most valuable to monitor. Current high-end servers can have more than 1,000 variables that can be potentially monitored by real-time surveillance systems. It would be neither practical, nor prudent, to just “monitor everything.”
One method for monitoring as many signals as possible is to correlate the signals and combine them into one signal that can be monitored by a pattern recognition system. However, in many high-end servers, the monitored signals are non-synchronous. Processes can speed up and slow down depending on many factors. Over time, signals generated by different processes can drift even further out of phase, which can greatly complicate the process of correlating the signals.
Furthermore, in large server computer systems, the monitored signals typically fall into a number of correlated groups. Signals within a given group are correlated with each other. However, there is little correlation between signals belonging to different groups. In order to efficiently correlate the signals, it is desirable to first “cluster” the signals into their respective correlated groups.
Hence, what is needed is a method and apparatus for correlating and clustering signals from numerous sources within a computer system, sources that are not only characterized by non-synchronous sampling intervals, but which may also be independently speeding up and slowing down while under surveillance.