There are a variety of methods available for managing data, particularly computer system performance data. These methods typically collect and store performance data, and produce a variety of reports based on that data. Such performance data tracks, for example, the amount of resources available on a system; the number of CPUs used at a particular time; the amount of physical memory available at a particular time, etc. In addition, such methods collect data on how such resources are utilized. For example, CPU utilization (the percent of time during the interval during which each CPU was busy and idle) is monitored as is the run queue length (average number of processes waiting in line to use the CPU), memory utilization (the percent of real memory in use), and the number of CPUs in a work group. The above lists just a few of the parameters that need to be monitored, stored, and analyzed.
When a computer system is being troubleshot (a real-time operation), or when a system is being viewed in real-time, data is typically collected every 5 to 15 seconds and displayed for the user. Data this precise is often needed to diagnose a performance problem. However, when archiving data for future use, it is not practical to store samples for every 15 second period for each collected data parameter, especially when the data is typically archived for 6 months or longer. Thus, in order to store the data in a reasonable amount of storage space, management systems typically use sampling techniques where the metric is measured once in the sampling interval and stored. The assumption being that the data being sampled does not change significantly during the sampling interval, and thus, the value at the time of the measurement is deemed to be representative of the entire sampling interval. For fast changing systems, such as computer systems, such a method is ineffective.
Another solution is to average the data. Thus, if the measurement system collects 20 samples during the interval, the values of those 20 samples are averaged when archiving, allowing the management system to store only one data point for the interval. Averaging does not work for interactive systems where users submit queries and wait for a response which is usually obtained in a matter of seconds. The demand on such workloads varies from one minute to the next. Thus, during a five minute interval, the computer system may be idle much of the time, and completely saturated for a small amount of time. Performance may be unacceptably slow during the brief periods of overload. This overload may not show up when averaged with long idle periods occurring in the same sampling interval. In this situation, a five minute average is not a good representation of actual system operation.
Another major drawback to averaging type systems stems from a more recent change in the nature of computing systems where vendors are introducing various forms of virtual partitions or virtual machines. These systems are dynamic, allowing the system to add or remove resources very quickly. Thus, in any system where performance data is stored for subsequent use it is important to be able to drill down to small increments of time to determine resource usage.
For example, assume a virtual machine that's idle for four minutes, and has only one CPU allocated to it during those four minutes. If that virtual machine becomes very busy for the final minute of a five minute measurement interval, and an additional five CPUs are added to handle the load, what should a management system report for the number of CPUs in the server during the five minute interval? The tool that uses sampling will report either a “1”, or a “6”. The system that stores the average value will report that the server had an average of 2 CPUs. None of these values are particularly useful for understanding system operation during that five minute interval.