One job that computer administrators often perform is to analyze the operation of the machines they oversee. To facilitate analysis, system software on each of these machines typically maintains a set of “performance counters.” The counters store various data relating to the operating status of the machines. Performance counters could reflect processor utilization, disk utilization, network traffic, or other aspects of the machine's operation. Machines typically maintain a few thousand counters, which indicate a wide variety of the machine's operational characteristics. The counters may be updated continually, so that, at any given point in time, the counters show the state of the system that exists at that time. Counter values may be captured recurrently (e.g., every minute, every hour, etc.). The captured counter values may be used for forensic analysis of the machine's operational health.
While counter values provide the raw data from which a machine's health theoretically can be assessed, in a real-world setting the amount of data may be too large to analyze, or even to store practicably. Many services are provided through server farms that have tens or hundreds of thousands of machines. If there are 100,000 machines in a server farm, each of which has 1,000 one-byte performance counters, then taking a snapshot of the performance counters across all 100,000 machines results in 100 megabytes of data. If the snapshot is taken once per hour, then the stored counter values amount to 2.4 gigabytes of data per day. 2.4 gigabytes may not be an unmanageable amount, but once per hour might be too low a sampling rate to yield meaningful analysis. For example, a machine might experience a few two- to three-minute spikes in which processor utilization hits nearly 100% of capacity. These spikes would be of interest to an analyst since they likely reflect an impact on the performance of the machine. However, such spikes could go undetected if the sampling rate is once-per-hour. The sampling rate could be increased to, say, once per minute. But with the example numbers above, a once-per-minute sampling rate increases the amount of performance data collected to 144 gigabytes per day. Analyzing performance data collected at this frequency over a period of days or weeks would involve storing terabytes of data.
Storing that volume of data is problematic. However, even if such a large volume of performance data could be stored conveniently, that volume of data would be impractical to analyze in raw form. Certain kinds of abstractions, such as averages and standard deviations, are often applied to raw performance data in order to simplify analysis and to reduce the size of the data to be stored. However, these abstractions present other problems. Averages often strip away meaningful information. For example, knowing that a machine's average processor utilization over a 24-hour period is 25% does not say whether the machine is overloaded. An average of 25% utilization could mean that machine spends all of its time with the processor at 25% utilization, which is probably a manageable load. However, the same 25% average could mean that the machine spends three quarters of its time with its processor at 0% utilization and one quarter of its time near 100%, in which case the machine spends one-quarter of its time in severe overload, and likely experiences performance degradation. Calculating a standard deviation may appear to address this problem by giving some sense of the distribution of the actual data relative to the average. However, a standard deviation is not good at describing data with a distribution that is not normal in the statistical sense (i.e., Gaussian), and many utilization scenarios on a machine are not normal.
One way to simplify analysis of counter values, or other performance data, is to plot the data on a graph. However, it is difficult to glean certain types of information from a graph. For example, if a performance counter value is captured once per minute and plotted against time on a graph, it may be difficult to determine from a visual read of the graph what percentage of a day is spent idling or in overload situations. Moreover, if there are 100,000 machines and a graph is generated for each machine each day, then there are 100,000 graphs per day. In many cases, each machine would have more than one performance counter of interest, and thus there could be more than one graph per machine per day to interpret. Analyzing performance for a large number of machines (e.g., 100,000 servers) by interpreting graphs is very labor intensive, and may not be practical.