Various approaches to monitoring the performance of a computing system or program exist. For example, the run time performance of a program may be monitored by sampling the execution state of the program at regular intervals. At each sampling interval, execution state (e.g., program counter, memory utilization, function call stack) is recorded. In another approach, the run time performance of a program may be monitored by instrumenting the program code. Instrumenting the program code includes manually or automatically (e.g., by a compiler) inserting code that records information about the program state during its execution. A related approach provides for the registration (possibly at runtime) of callbacks that are to be executed in response to the occurrence of particular events (e.g., exceptions, object creation, function calls). Such callbacks can then record information about the execution state of the program.
Monitoring the performance of distributed systems presents special challenges. First, monitoring the performance of even a single system or program can result in the generation of large volumes of performance-related data. This problem is exacerbated when the performance of many computing systems is monitored over a long period of time.
Furthermore, performance monitoring in the distributed systems context gives rise to a need to centralize and/or aggregate stored performance-related data. In one approach, each computing system locally stores its performance-related data, which is then later retrieved and aggregated for analysis. This approach does not allow for substantially real-time analysis of the performance of the monitored systems, individually or as a whole.
In another approach, each computing system transmits performance-related data to a central monitoring system. This solution does not scale well. In particular, the centralized monitoring system becomes a bottleneck for network and storage utilization. Nor is this solution robust. If the disk or other storage device utilized by the central monitoring system fails, performance-related data from all hosts may be lost.