The invention relates to a system and methods for monitoring a set of metrics. More particularly, the invention provides a system and methods for grouping and correlating metrics based on alarm events.
Transactions are at the heart of web-based enterprises. Without fast, efficient transactions, orders dwindle and profits diminish. Today""s web-based enterprise technology, for example, is providing businesses of all types with the ability to redefine transactions. There is a need, though, to optimize transaction performance and this requires the monitoring, careful analysis and management of transactions and other system performance metrics that may affect web-based enterprises.
Due to the complexity of modern web-based enterprise systems, it may be necessary to monitor thousands of performance metrics, ranging from relatively high-level metrics, such as transaction response time, throughput and availability, to low-level metrics, such as the amount of physical memory in use on each computer on a network, the amount of disk space available, or the number of threads executing on each processor on each computer. Metrics relating to the operation of database systems and application servers, operating systems, physical hardware, network performance, etc. all must be monitored, across networks that may include many computers, each executing numerous processes, so that problems can be detected and corrected when (or preferably before) they arise.
Due to the complexity of the problem and the number of metrics involved, it is useful to be able to call attention to only those metrics that indicate that there may be abnormalities in system operation, and to correlate and group such metrics, so that an operator of the system does not become overwhelmed with the amount of information that is presented. Correlating and grouping metrics may also assist operators to determine the cause of problems that arise, so that the proper corrections can be applied.
Unfortunately, due to the number of metrics, their complexity and their interrelations, the correlation and cluster analysis techniques that are commonly used in monitoring systems frequently are not sufficient. For instance, the linear correlation techniques that are commonly used in monitoring systems are sensitive to outliers (e.g., very bad data samples or spikes), and do not detect highly non-linear relationships between metrics.
Another difficulty is that large systems with thousands of variables make metric pair correlation, which is commonly used in monitoring systems, impractical. Many web-based enterprise systems are large disparate systems containing thousands of monitored variables. If typical metric pair correlation techniques are used, all metrics in the system must be paired up when the metric values are correlated. In a large system containing thousand of variables, there are millions of possible pairs of metrics to compare. For example, in a system having 1500 metrics, there would be (1500*1499)/2=1,124,250 metric pairs to be tested for correlation. This makes the task of testing every pair for correlation in such complex systems impractical.
Another problem with the techniques employed by typical monitoring systems is that they are difficult to apply to dynamic or adaptive systems. The conditions in a web-based enterprise system may change from day to day, week to week, or month to month. Some conditions change more frequently, e.g., minute to minute or hour to hour. In any case, these changes will be reflected in the values of the various metrics, and possibly in the interrelations between metrics. To monitor a complex web-based enterprise system successfully, it is necessary for the system to adapt dynamically to these changes when performing tasks such as generating alarms or determining which groups of metrics are associated with each other to assist in identifying the cause of a failure or problem.
To assist in determining the cause of a problem, monitoring systems typically use cluster analysis algorithms to find metrics that are associated with a key offending metric. The goal of such cluster analysis is to create groups of metrics containing closely related members.
Generally, cluster analysis is the process of grouping data into classes or clusters. A cluster is a collection of data objects (metrics, in the case of monitoring systems) that are similar to one another within the same cluster and dissimilar to the metrics in other clusters. Between clusters there is little association. A metric cannot belong to two clusters, and it is possible that a metric does not belong to any cluster.
When an operator queries the system to identify any metrics related to an offending alarm metric, the cluster of the offending metric is identified. Each metric in the cluster is known and this knowledge may provide clues as to the cause of an alarm event.
Unfortunately, the algorithms that are typically used for cluster analysis are complex and are difficult to apply to a dynamic or adaptive system. Since cluster analysis divides metrics into static disjoint groups, it is difficult to use these algorithms in situations where the relations between metrics may change, thereby changing the groups. Because of this inability to track the changing interrelations between metrics, a web-based enterprise management system that uses typical cluster analysis can overlook correlations of metrics between groups and thus miss possible problem causes.
In view of the foregoing, there is a need for a system and methods for monitoring complex systems having large sets of performance metrics, wherein the metrics may have non-linear relationships with each other, and wherein the metrics are correlated and grouped in a dynamic or adaptive manner. Some embodiments of the present invention, for example, use Spearman rank-order correlation to correlate metrics with non-linear relationships and outliers in the data. In another aspect of the present invention, bursts of threshold alarms are used to trigger intelligent data collection to discover correlated pairs of metrics without requiring that all possible metric pairs be tested for correlation. In a further aspect of the present invention, a dynamic correlation pair graph is used to determine which metrics are associated with a key metric. The correlation pair graph dynamically maintains all correlated relationships between the metrics.
In general, in one aspect, the system receives data associated with numerous metrics, and receives notification of groups of threshold violations associated with out-of-tolerance metrics. Data associated with each of the out-of-tolerance metrics in a group is synchronized, and the synchronized data is used to calculate correlation coefficients between the metrics in the group of out-of-tolerance metrics. By calculating the correlation coefficients only between metrics in a group of out-of-tolerance metrics, the system greatly reduces the number of metrics that it must attempt to correlate.
In some embodiments, the correlation coefficients are used to highlight performance information in an e-commerce system. Some embodiments use the correlation coefficients to determine the root cause of the threshold violations.
In some embodiments, the data associated with each out-of-tolerance metric includes historical data. Additionally, in some embodiments, receiving notification of a group of threshold violations includes receiving notification of a frequency of the threshold violations.
In some embodiments, synchronizing the data associated with the metrics in a group of out-of tolerance metrics is performed by arranging the data associated with each out-of-tolerance metric into a time-ordered sequence, aligning each time-ordered sequence along a common time scale, and determining whether there is missing data in any of the time slots. In some embodiments, all the data associated with a time slot in which there is missing data is deleted. In other embodiments, such missing data is handled by pairwise deletion of data within a time slot.
Some embodiments use rank correlation techniques to correlate the data, and more particularly, Spearman rank-order correlation may be used. By using rank correlation, the system is able to correlate metrics having non-linear relationships and outliers.
In some embodiments, the data associated with each out of tolerance metric may be offset by an offset amount, and the correlation coefficients calculated based on the offset data. The system may then select either the correlation coefficient computed with the offset data, or the correlation coefficient computed based on the synchronized data to determine whether metrics are correlated. In some embodiments, the data is offset by arranging the data into a time-ordered sequence having time slots, and then shifting the time ordered sequence associated with one of the out-of-tolerance metrics by a number of time slots relative to the data of the other metrics, and determining whether there is missing data in any of the time slots. By applying such an offset to the data, the system can detect correlations between metrics where one metric reacts earlier or later to a condition than other metrics within a group of out-of-tolerance metrics.
Computation of correlation coefficients with the offset data may generally be done in a manner that is the same or similar to the manner in which correlation coefficients for the synchronized data are computed. In some embodiments, this may include deletion of all data in time slots, or pairwise deletion of data, when there is missing data within a time slot. In some embodiments, this may include use of rank correlation techniques, such as Spearman rank-order correlation.
In some embodiments of the invention the correlation coefficient associated with a pair of correlated metrics may be stored, and updated if arrival of a new group of out-of-tolerance metrics causes, a change in the correlation coefficient. In some embodiments, stored correlations between a pair of metrics may be decreased in response to a lack of additional data supporting correlation. In some embodiments the correlation coefficient may be decreased in response to a lack of additional data associated with the out-of-tolerance metrics over a predetermined time period. If the correlation coefficient for a pair falls below a predetermined threshold, the correlation coefficient associated with that pair may be deleted. In some embodiments the correlation coefficient may be increased in response to additional data supporting correlation.
These stored correlations may be stored, in accordance with one aspect of the invention, as a correlation pair graph, in which each node in the graph represents a metric, and each edge or link between nodes represents a correlation between a pair of nodes. The updating and deleting of the correlation coefficients of pairs of metrics is reflected in the correlation pair graph, making the system dynamic and adaptive, changing the correlations and the correlation pair graph as conditions change.
In another aspect of the invention, the system identifies metrics associated with a key metric by selecting those metrics that are correlated with the key metric, and that are also correlated with a predetermined percentage of other metrics that are correlated with the key metric. In some embodiments, the correlation pair graph structure may be used to make these determinations. Since the correlation pair graph structure (and the stored correlations) are dynamic and adaptive, the identification of metrics associated with the key metric will also be dynamic and adaptive.
In a further aspect of the invention, the system identifies metrics associated with a key metric by selecting metrics that, while not themselves correlated with the key metric, are correlated with at least a predetermined percentage of the metrics that are correlated with the key metric. As before, in some embodiments, the correlation pair graph structure may be used to make this determination, and the identification of the metrics associated with the key metric is dynamic and adaptive.
In some embodiments, the methods of dynamically correlating metrics and identifying key metrics can be implemented in software. This software may be made available to developers and end users online and through download vehicles. It may also be embodied in an article of manufacture that includes a program storage medium such as a computer disk or diskette, a CD, DVD, or computer memory device. The methods may also be carried out by an apparatus that may include a general-purpose computing device, or other electronic components.