1. Field of the Invention
This invention relates generally to system performance monitoring, especially for performance monitoring of a distributed computer network system with a massive number of nodes or consoles.
2. Description of the Related Art
The data processing resources of business organizations are increasingly taking the form of a distributed computing environment in which data and processing are disbursed over a network comprising many interconnected, heterogeneous, geographically remote computers. Such a computing environment is commonly referred to as an enterprise computing environment, or simply an enterprise. Managers of the enterprise often employ software packages known as enterprise management systems to monitor, analyze, and manage the resources of the enterprise. Enterprise management systems may provide for the collection of measurements, or metrics, concerning the resources of individual systems. For example, an enterprise management system might include a software agent on the individual computer system for the monitoring of particular resources such as CPU usage or disk access. U.S. Pat. No. 5,655,081 discloses one example of an enterprise management system.
In a sophisticated enterprise management system, tools for analysis, modeling, planning, and prediction of system resources utilization are useful for assuring the satisfactory performance of one or more computer systems in the enterprise. Examples of such analysis and modeling tools are the “ANALYZE” and “PREDICT” components of “PATROL Perform/Predict for UNIX or Windows” or “BEST/1 for Distributed Systems” available from BMC Software, Inc. Such tools usually require the input of periodic measurements of the usage of resources such as CPUs, memories, hard disks, network bandwidth, number of files transferred, number of visitors to a particular web page, and the like. To insure accurate analysis and modeling, therefore, the collection of accurate performance data is critical.
Many modern operating systems, including “Windows NT” and UNIX, are capable of producing an enormous amount of performance data and other data concerning the state of the hardware and software of the computer system. Such data collection is a key step for any system performance analysis and prediction. The operating system or system software collects raw performance data, usually at a high frequency, stores the data in a registry of metrics, and then periodically updates the data. In most case, metric data is not used directly, but instead sampled from the registry. Sampling at a high frequency can consume substantial system resources such as CPU cycles, storage space, and I/O bandwidth. Therefore, it is impractical to sample the data at a high frequency. On the other hand, infrequent sampling cannot capture the complete system state: for example, significant short-lived events and/or processes can be missed altogether. Infrequent sampling may therefore distort a model of a systems performance. The degree to which the sampled data reliably reflects the raw data determines the usefulness of the performance model for system capacity planning. The degree of reliability also determines the usefulness of the performance statistics presented to system managers by performance tools.
Sensitivity to sampling frequency varies among data types. Performance data can be classified into three categories: cumulative, transient, and constant. Cumulative data is data that accumulates over time. For example, a system CPU time counter may collect the total number of seconds that a processor has spent in system state since system boot. With transient data, old data is replaced by new data. For example the amount of free memory is a transient metric which is updated periodically to reflect the amount of memory not in use. For transient metrics the only way to find even approximate means, variances, or standard deviations is to do periodic sampling. The third type of performance data, constant data, does not change over the measurement interval or lifetime of the event. For example, system configuration information, process ID, CPU model type, and process start time are generally constant values.
Of the three data types, transient performance metrics are the most sensitive to variations in the sampling interval and are therefore, the most likely to be characterized by uncertainty. For example, with infrequent sampling, some state changes may be missed completely. However, cumulative data may also be rendered uncertain by infrequent sampling, especially with regards to the calculation of the variation of such a metrics. Clearly then, uncertainty of data caused by infrequent sampling can cause serious problems in performance modeling. A related patent application titled “Enterprise Management System and Method Which Include Statistical Recreation of System Resource Usage for More Accurate Monitoring, Prediction and Performance Workload Characterization,” Ser. No. 09/287,601, discloses a system and method that meets the needs for more accurate and efficient monitoring and prediction of computer system performance.
Even when sampling frequencies are reduced, the performance data collected by system monitors can still be enormous. Traditional performance monitoring methods and/or tools display performance metric values at a rate similar to the rate they are sampled. To accurately monitor the hardware and software of a computer system, many different metrics are sampled, collected, stored and/or reported. When a computer network system or enterprise comprises only a few nodes, the aggregation of the monitoring data from each of the few nodes may not be a problem. But when the system grows, the performance data collected from each computer or node will increase proportionally. The large quantity of data that has to be pushed or pulled across a network for displaying or reporting becomes impractical or even impossible when hundreds or even thousands of nodes are managed from a few nodes or consoles. Therefore, it is desirable to have a method or system to further reduce the growth of data quantity in order to maintain the ability to monitor the performance of each node.