1. Field of the Invention
The present invention relates to the collection, analysis, and management of system resource data in distributed or enterprise computer systems, and particularly to the modeling and analysis of system resources and prediction of system performance.
2. Description of the Related Art
The data processing resources of business organizations are increasingly taking the form of a distributed computing environment in which data and processing are dispersed over a network comprising many interconnected, heterogeneous, geographically remote computers. Such a computing environment is commonly referred to as an enterprise computing environment, or simply an enterprise. Managers of the enterprise often employ software packages known as enterprise management systems to monitor, analyze, and manage the resources of the enterprise. Enterprise management systems may provide for the collection of measurements, or metrics, concerning the resources of individual systems. For example, an enterprise management system might include a software agent on an individual computer system for the monitoring of particular resources such as CPU usage or disk access. The enterprise management agent might periodically collect metric data and write to a data repository containing historical metric data, i.e., metric data previously collected over a period of time. This metric data can be used to create models of one or more computer systems in the enterprise for modeling, analysis, and prediction of system performance. As network-based client/server models have become more popular in enterprise-wide computing infrastructures, however, the associated performance issues have become more sophisticated and complicated as well.
The increasing complexity of computer systems and inherent limitations in hardware and software are fertile ground for the effects of chaotic behavior. Chaos is the unpredictable behavior of dynamical systems. When resource utilization is low, the system limitations are avoided or not exposed, and chaotic behavior is usually not a problem. However, when utilization is moderate to high, then system limits are reached. Common limitations include hardware limitations such as memory space and disk size and software limitations such as fixed buffer sizes and string lengths. When these system limits are reached, computer systems are more likely to break down and/or behave chaotically. The impact of chaotic behavior on the performance of a computer system can be enormous. In software, problems such as infinite loops, memory leaks, network waiting time-outs, and runaway processes often cause serious performance problems and even system shutdowns. Over time, for example, applications with memory leaks eventually use up most or all of available memory. Consequently, the I/O or paging subsystem is saturated with excessive paging, and the system""s perceived processing power is reduced. Hardware glitches can also cause performance degradation. For example, when a network segment failure causes traffic to be routed through other segments, utilization increases on the other segments, and chaotic behavior may arise.
Typically, computer performance modeling has used the exponential assumption to model system behavior. Recently, however, the exponential assumption has come under scrutiny. Research has shown that some performance measurements, such as process service times and network traffic, are more chaotic than had been previously assumed. For instance, many recent empirical studies have suggested that UNIX CPU process lifetimes, disk file sizes, World Wide Web (WWW) file transfer sizes, and network traffic exhibit properties consistent with heavy-tailed or power-tailed (PT) distributions rather than exponential distributions. See, for example, W. E. Leland, M. S. Taqqu, W. Willinger, and D. V. Wilson, xe2x80x9cOn the Self-Similar Nature of Ethernet Traffic (Extended Version),xe2x80x9d IEFE/ACM Trans. Networking, Vol. 2, No. 1, pp. 1-15, 1994; M. Crovella and A. Bestavros, xe2x80x9cSelf-Similarity in World Wide Web Traffic: Evidence and Possible Causes,xe2x80x9d In Proceedings of SIGMETRICS Conference on Measurement and Modeling of Computer Systems, 1996; M. Greiner, M. Jobmann, and L. Lipsky, xe2x80x9cThe Importance of Power-Tail Distributions for Telecommunications Traffic Models,xe2x80x9d Operations Research, 1999. Power-tail distributions, unlike exponential distributions, exhibit very xe2x80x9cburstyxe2x80x9d and xe2x80x9cchaoticxe2x80x9d behavior. Power-tail distributions are defined in the Glossary in the Detailed Description.
Power-tail distributions can explain many modeling and performance problems that have been considered xe2x80x9cexceptionalxe2x80x9d in the past. When an exponential distribution is assumed to be present, performance predictions may be overly optimistic. This could mislead capacity planners in their decision-making and adversely affect QoS (Quality of Service) of end-users. Nevertheless, existing tools for the analysis and prediction of performance are unable to construct models that account for the significant performance ramifications of chaotic behavior. Previous research into the identification of power-tail phenomena has focused on techniques to access a particular property of power-tail distribution. At present, there are no known xe2x80x9cgenericxe2x80x9d and efficient tests, algorithms, or methods in the prior art to identify whether or not independent and identically distributed random variables are power-tail distributed, which is one of the causes of chaotic behavior in enterprise-wide computer systems.
For the foregoing reasons, there is a need for accurate detection of chaotic or power-tailed behavior in computer systems.
The problems outlined above are in large part solved by various embodiments of a system and method for accurately and efficiently detecting chaotic or power-tailed behavior in computer systems. In one embodiment, the system and method are used in a distributed computing environment, i.e., an enterprise. The enterprise comprises a plurality of computer systems, or nodes, which are interconnected through a network. At least one of the computer systems is a monitor computer system from which a user may monitor the nodes of the enterprise. At least one of the computer systems is an agent computer system. An agent computer system includes agent software that permits the collection of data relating to one or more metrics, i.e., measurements of system resource usage on the agent computer system.
In one embodiment, analysis and/or prediction software receives a set of data points from agent software on one or more computer systems, wherein the set of data points represents a series of metrics. The data points are assumed to be independent and identically distributed. The analysis and/or prediction software determines whether there is a renewal power-tail behavior in the set of data points by performing two or more analytic tests on the set of data points and then combining the results of the analytic tests to determine an overall likelihood of power-tail or chaotic behavior.
In a preferred embodiment, three analytic tests are performed: a first test to determine whether the largest sample of a set of data points exhibits large deviations from the mean, a second test to determine whether the set of data points exhibits a high variance, and a third test to determine whether the set of the largest data points exhibits properties consistent with large values in the tail portion of the power-tail distribution. The tests detect whether or not distinctive properties of a power-tail distribution are present in the set of data points. The tests can be performed in any order, and in other embodiments, fewer than three can be performed. Each test has two possible results: successful if the test indicates a likelihood of power-tail behavior, or unsuccessful if it indicates that power-tail behavior is unlikely. The results of the first analytic test, the second analytic test, and the third analytic test are then combined and compared with one another to determine the overall likelihood of a power-tail distribution in the set of data points.
In one embodiment, the first analytic test is performed by an algorithm for determining whether the largest sample in a set of data points exhibits large deviations from the mean. The largest order statistic or an approximation thereof, i.e., the substantially largest data point of the set of data points, is determined. The probability PD that a random variable X is greater than or equal to the substantially largest data point is computed. The probability PE that a random variable X is greater than or equal to the expected value of the substantially largest order statistic from the exponential distribution is computed. An arbitrarily small tolerance factor is determined. The final step of the first algorithm is to determine if the probability PD is substantially less than or equal to the tolerance factor and the probability PD is less than or equal to the probability PE. If the answer to the final step is affirmative, then the first test is successful. If the answer is negative, then the first test is unsuccessful.
In one embodiment, the second analytic test is performed by an algorithm for determining whether the set of data points exhibits a high variance. The power-tail variance index for a power-tail distribution with a power-tail index xcex1 of 2 is computed. The variance of the set of data points is computed. The final step of the second algorithm is to determine if the variance of the set of data points is greater than or equal to the power-tail variance index. If the answer to the final step is affirmative, then the second test is successful. If the answer is negative, then the second test is unsuccessful.
In one embodiment, the third analytic test is performed by an algorithm for determining whether the set of the largest data points exhibits properties consistent with large values in the tail portion of the power-tail distribution. The set of data points is normalized such that the expected value of the set of data points is 1. As in the first algorithm, the substantially largest data point of the set of data points is determined. The power-tail index xcex1 of the set of data points is estimated. The final step of the third algorithm is to determine if the power-tail index xcex1 of said set of data points is less than 2. If the answer to the final step is affirmative, then the third test is successful. If the answer is negative, then the third test is unsuccessful.
When three tests are performed, there are eight possible outcomes of the combined tests (or 23 outcomes). If all three tests are successful, then the analysis and/or prediction software concludes that power-tail behavior is likely. If all three tests are unsuccessful, then the analysis and/or prediction software concludes that power-tail behavior is unlikely. If the results are a combination of successful and unsuccessful (i.e., 2 successful and 1 unsuccessful or 1 successful and 2 unsuccessful), then typically more data or analysis is needed to arrive at a conclusion.
In response to the detection or non-detection of chaotic or power-tailed behavior of one or more computer systems or networks in the enterprise, the system and method are operable to use this information in modeling and/or analyzing the enterprise. In various embodiments, the modeling and/or analyzing may further comprise one of more of the following: displaying the detection or non-detection of the power-tail distribution to a user, predicting future performance, graphing a performance prediction, generating reports, asking a user for further data, permitting a user to modify a model of the enterprise, and altering a configuration of the enterprise in response to the detection or non-detection of the power-tail distribution.