1. Field of the Invention
The present invention pertains to monitoring and measuring behavior of a managed system. More particularly, this invention relates to predicting system behavior of a managed system (e.g., a distributed application system) using image processing and pattern recognition techniques.
2. Description of the Related Art
As we know, one prior art monitoring solution for measuring a software application running on a computer system employs predetermined static threshold values to measure the performance of the application. FIG. 1 shows this prior art solution. The threshold value is determined based typically on experience and/or intuition. This prior art solution is acceptable for applications running on a single computer machine and involving very few measurements.
However, this prior art solution will not be suitable for measuring large dynamic distributed applications with hundreds of metrics. As is known, a distributed application system operates in a distributed or federated computing environment. Such an environment is typically characterized by independent computer systems in individual administrative domains that are loosely coupled by networks and cooperate to provide computing resources for a global application (i.e., the distributed application). One example of such a distributed application system is Internet.
One reason that the above-mentioned prior art solution will not be suitable for the large dynamic distributed applications with hundreds of metrics is that this simple prior art solution is typically not sufficient to capture complex correlation between various metrics. In particular, it is not sufficient to capture complex correlation between metrics on different computer systems in different administrative or control domains. Another reason is that this prior art approach is not flexible enough to incorporate the dynamic behavior of the distributed application, which may radically change over time.
Another problem associated with the above-mentioned prior art solution is that the use of predetermined static threshold values is sensitive to spikes in the measured data. For example and as can be seen from FIG. 1, if the value of one measurement exceeds the threshold for a short period of time due to a transient malfunction of the application, alarms will go off, notifying the existence of a problem. This will result in increased false positives, which can be very annoying, and in some cases, costly.
Another prior art approach to monitoring a system with thresholds is referred to as baselining. The main idea of baselining is to automatically determine what is xe2x80x9cnormalxe2x80x9d or xe2x80x9cexpectedxe2x80x9d value of a metric or measurement. In general, a baseline is a representation of how a system behaves under normal conditions at various times. This is particularly useful for selecting threshold values that defines desirable or acceptable ranges for each of the metrics as a function of the baseline for that metric. FIG. 2 shows a threshold range that is defined as a function of the baseline 11. As can be seen from FIG. 2, the curve 12 shows the upper threshold of the baseline 11 and the curve 13 shows the lower threshold of the baseline 11.
The advantage of this approach is the ability to automatically select threshold values that takes into account the dynamic behavior of the system being monitored. However, problems are still associated with this prior art baselining approach. One problem is that the approach does not capture relationships between metrics, which reduces its predictive power and limits its use to single metric or predefined functions that represent known (and mostly simple) relationships. This means that the approach still looks at each individual measurement in isolation. Another problem associated with the approach is its sensitivity to several required parameters such as sampling rate and age factor. As a result, system behavior of the monitored system cannot be accurately predicted. The prior art approach can only indicate problems when the problems actually occur. Another problem is its inability to classify problems. It can only tell when a problem occurs.
These above described approaches detect problems when they occur, which may not give time to take actions to correct the problem, or prevent it from happening. Thus, there exists a need for predicting system behavior of a distributed application system ahead of time and with high degree of accuracy.
One feature of the present invention is to predict system behavior of a managed system.
Another feature of the present invention is to predict system behavior of a managed system with high degree of accuracy.
Another feature of the present invention is to predict, with high degree of accuracy, system behavior of a managed system using image processing and pattern recognition techniques.
A further feature of the present invention is to identify a set of patterns from system measurements of a managed system that can predict with high degree of accuracy potential problems of the managed system.
A further feature of the present invention is to allow simultaneous comparison of multiple measurement metrics by combining multiple measurement metrics into one image.
A system for predicting system behavior of a managed system includes a measurement module coupled to the managed system to generate measurement data of the managed system. The measurement data include current measurement data and past measurement data. The past measurement data contains an indication of a problem of the managed system. A pattern classification module is coupled to the measurement module to process the past measurement data into a plurality of representative pattern images, and to select one of the pattern images that best identifies the problem as a predictor pattern image. A pattern matching module is coupled to the pattern classification module and the measurement module to process the current measurement data into a plurality of pattern images using the same image processing technique that generates the predictor pattern image. The pattern matching module identifies any pattern image that matches the predictor pattern image to predict the problem in the managed system.
A system for generating a predictor pattern image for predicting system behavior of a managed system includes a storage that stores past measurement data that contains an indication of a problem of the managed system. A pattern classification module is coupled to the storage to process the past measurement data into a plurality of representative pattern images, and to select the predictor pattern image that best identifies the problem from the representative pattern images. The predictor pattern image predicts the occurrence of the problem by identifying any pattern of current measurement data of the managed system that matches the predictor pattern image.