The efficient, optimum operation of large, complex systems, such as web-based enterprise systems, requires the monitoring, careful analysis and identification of system metrics that reflect the performance of the system and the use of information regarding system metrics to identify probable root causes of performance problems in complex, distributed, multi-tier applications. This need is important during both pre-deployment load/stress testing and in production operations, typically requires quickly sifting through thousands of metrics across multiple application tiers and servers and the determination of a probable handful of problem area that require remediation. It must also be noted that these needs become even more critical, and are even more difficult to satisfy, when personnel and expertise are limited.
A complex modern system, however, such as web-based enterprise systems, may involve thousands of performance metrics, ranging from relatively high-level metrics, such as transaction response time, throughput and availability, to low-level metrics, such as the amount of physical memory in use on each computer on a network, the amount of disk space available, or the number of threads executing on each processor on each computer, any of which or any combination of which may be significant to the operations of the system and abnormalities therein. Such metrics, which in general relate to the operation of database systems and application servers, operating systems, physical hardware, network performance, and so on, all must be monitored, across networks that may include many computers, each executing numerous processes, so that problems can be detected and corrected when or preferably before they arise.
Some system monitoring methods of the prior art have attempted to identify and monitor only the metrics and combinations of metrics that are significant in representing the operation of a system and in detecting any abnormalities therein. Due to the complexity of modern systems, however, and because of the large number of possibly significant metrics and combinations of metrics, it is unfortunately common for a monitoring system or operator to miss monitoring at least some of the significant metrics or combinations of metrics
Other systems of the prior art seek to avoid this problem by attempting to monitor as many metrics or combinations of metrics. It is a relatively common occurrence, however, again due to the complexity of modern systems and the large number of possibly significant metrics and combinations of metrics, that a monitoring system or an operator will become overwhelmed by the volume of information that is presented and will as a result miss or misinterpret indications of, for example, system abnormalities. It is therefore advantageous to be able to clearly and unambiguously identify and provide information pertaining to only those metrics or combinations of metrics that are of significance to or usefully represent and reflect the performance of the system, such as abnormalities in system operation.
The clear and unambiguous identification and presentation of metrics and combinations of metrics accurately reflecting system performance or problems, in turn, involves a number of data collection and processing operations, each of which involves unique and significant problems that have seriously limited the capabilities of performance monitoring systems in the prior art and, as a consequence, the performance of the systems being monitored.
The problems of the prior art begin with the original collection of meaningful and timely information regarding system performance metrics. A typical large system may involve perhaps thousands of performance metrics, each of which may or may not be of significance to the performance of the system or indicative of problems in the system. The first recurring problem is therefore to select and collect sufficiently accurate and timely information pertaining to a sufficiently large number of performance metrics so as to have a useful representation of the system operation and performance. Again, however, this is often an overwhelming task due to the potentially very large number of metrics of interest.
Another recurring problem area in performance monitoring systems and methods of the prior art is in identifying those metrics, from all of the metrics that can be or that are monitored, that represent a current or past aspect of interest of the system performance. This task is conventionally performed by detecting those performance metrics that are outside the bounds of normal operation, which is typically carried out by checking the values of the monitored metrics and combination of metrics against threshold values. If the metric is within the range defined by the threshold values, then the metric is behaving normally. If, however, the metric is outside the range of values defined by the thresholds, an alarm is typically raised, and the metric may be brought to the attention of an operator. The determination and setting of thresholds, however, can adversely effect the operation of the system since significant events may fail to trigger an alarm if a threshold is set too high, but an excessive number of false alarms can be generated if a threshold is set too low.
Many monitoring systems of the prior art allow an operator to set the thresholds beyond which an alarm should be triggered for each metric. In complex systems that monitor thousands of metrics, however, this is often impractical since setting such thresholds is usually labor intensive and is typically prone to errors, particularly since the value of a given metric, and thus its thresholds, may be influenced by a number of other metrics. Additionally, user-specified thresholds, which must typically be fixed due to the time and effort required to determine and set the thresholds, are inappropriate for many metrics, such as metrics for systems with time varying loads or performance levels.
In an attempt to mitigate such problems, including both the problems in determining and setting large numbers of thresholds and in determining and setting time varying thresholds, some systems provide a form of dynamically-computed thresholds using simple statistical techniques, such as standard statistical process control (SPC) techniques that assume that the distribution of values of a given performance metric fit some standard statistical distribution, and measure the values of the metric against the assumed statistical distribution. For example, the values of many performance metrics at least approximately fit a Gaussian, or “normal” distribution and many commonly used SPC methods are based upon such “normal”, or Gaussian”, distributions.
Unfortunately, however, the distribution of values of many metrics do not fit such “normal” of Gaussian distributions, making the thresholds that are set using typical SPC techniques inappropriate for such metrics or systems. For example, and as illustrated in FIG. 1, many performance metrics have values that at least approximately fit Gamma distributions, which are characterized by asymmetric distributions of values about a mean, while Gaussian or normal distributions are characterized by having symmetric distributions of the values about the mean. As a consequence, and again as illustrated in FIG. 1, if SPC techniques based on normal or Gaussian distributions are used to generate threshold values for a metric having a non-symmetric distribution, such as a Gamma distribution, the upper limit will be generally set too low when the lower threshold is set correctly and the lower threshold will be set too high if the upper threshold is set correctly. In a further example, many performance metrics exhibit self-similar or fractal statistical distributions and, again, typical SPC techniques using normal or Gaussian distributions will generally fail to produce optimal or even useful thresholds.
It should be noted that some attempts of the prior art to address this problem have attempted to do so by transforming the sample data from a non-symmetric distribution to a normal or Gaussian distribution. Such attempts have not been successful, however, because the underlying assumptions, that is, that processes for normal statistical distributions are valid for non-system distributions, are fundamentally fallacious no matter what is done to the data since the data is still fundamentally non-symmetric. In addition, attempts to transform non-symmetric data into normal distribution data inherently and inescapably distorts the data, again resulting in fundamentally erroneous results.
In a further example of the problems recurrent in detecting metric values by threshold comparison, many performance metrics exhibit periodic patterns, such as having value ranges varying significantly according to time-of-day, day-of-week, or longer activity cycles. Thus, for example, a metric may have one range of typical values during part of the day and a substantially different set of typical values during another part of the day. Current dynamic threshold systems typically fail to address this issue.
In a still further typical problem in metric thresholding, current dynamic threshold systems typically do not incorporate metric values occurring during alarm conditions, that is, during periods when the metric value exceeds the current threshold, when adjusting the threshold value for the metric to accommodate changes in the metric value range over time. The threshold system may, as a consequence, treat a long term, persistent change in the values of a metric as a series of short alarm conditions rather than as a long term shift in the metric value range. This situation will often result in numerous false alarm conditions until the threshold is adjusted to accommodate the shift in metric value range, which often requires manual adjustment of the threshold by an operator.
A still further recurring problem in monitoring the performance of a system is the methods for use of the information regarding system metrics to identify probable root causes of performance problems in complex, distributed, multi-tier applications, such as identifying metrics and correlations and groupings of metrics and metric behaviors that particularly reflect system performance or abnormalities. This problem is typically due to or is aggravated by the large number of metrics and combinations of metrics and metric inter-relationships that must be considered.
For example, many systems of the prior art employ linear correlation techniques to detect the correlations between and groupings of metrics that are representative of certain system performance characteristics. Linear correlation techniques, however, are sensitive to “outlier” metric values, that is, erroneous measurements of metric values, metric values that are significantly outside the expected range of values, that is, “spikes” in metric values. In addition, linear correlation techniques do not reliably or accurately detect or represent highly non-linear relationships between metrics.
Another commonly employed method for identifying the metrics associated with or particularly indicative of system performance or abnormalities is metric pair correlation. Metric pair correlation, however, requires the comparison of each possible pair of metrics, so that metric pair correlation in a large system having perhaps thousands of metrics will typically require the comparison of millions of possible pairs of metrics.
Yet another method commonly used in determining the cause of a problem or abnormality is the use of cluster analysis algorithms to find metrics that are associated with a key offending metric, and the goal of such cluster analysis methods is to create groups of metrics containing significantly related members. In general, cluster analysis is the process of grouping metric data into classes or clusters of metrics wherein a cluster is a collection of data objects, such as metrics that are similar to one another within the same cluster and dissimilar to the metrics in other clusters. In principle, therefore, there is little association or correlation between clusters of metrics, and that while a cluster may not belong to any cluster, it cannot belong to more than one cluster. When an operator identifies a metric related to an offending alarm metric, therefore, the identification will also be of the cluster of which the metric is a member. Because all metrics in that cluster are known and are known to be causally related to the identified metric, the totality of information regarding the metrics of the cluster may provide clues as to the cause of the alarm event that identified the initial metric.
Unfortunately, the algorithms that are typically used for cluster analysis are complex and are difficult to apply to a dynamic or adaptive system. For example, cluster analysis divides metrics into static disjoint groups, so that it is difficult to use these algorithms in situations where the relations between metrics may change. That is, changes in the metrics may also result in changes in the relationships between the metrics and thereby in the clusters. It then becomes impossible or at least impractical to track the changing correlations between metrics and, as a result, a cluster analysis based system can overlook correlations of metrics between groups and thus miss possible problem causes.
Lastly, the above discussed problems with the techniques employed by typical monitoring systems in correlating or grouping metrics for the purpose of determining system performance or problems is further compounded because these methods are difficult to apply to dynamic or adaptive systems. That is, and for example, the conditions in a web-based enterprise system may change from day to day, week to week, or month to month, and some conditions change more frequently, such as minute to minute or hour to hour. In any case, these changes will be reflected in the values of the various metrics, and possibly in the interrelations between metrics. To monitor a complex web-based enterprise system successfully, therefore, it is necessary for the system to adapt dynamically to these changes when performing tasks such as generating alarms or determining which groups of metrics are associated with each other in such a way as to assist in identifying the cause of a failure or problem. As in the case of metric pair correlation, for example, the number of operations required to dynamically update the metric thresholds to reflect the current ranges of metric values will often render this method impractical.
The systems and methods of the present invention address these and other related problems of the prior art.