As computers become more complex and powerful, monitoring the overall “health” of a computer becomes a greater concern, particularly when problems occur and the causes of those problems need to be identified and resolved. For this reason, a number of techniques have been developed for collecting information, often referred to as performance metrics, relating to the state of a computer during its operation.
For example, one manner of collecting performance metrics relies upon counters and/or timers that are instrumented into a running system and that provide real-time feedback about the number, type and performance of various processes running in a computer and the resources being utilized by those processes. Counters and timers, however, are usually directed to collecting specific pieces of information, and do not provide a comprehensive set of information about the overall performance of a computer or any of its components. Thus, while counters and timers can be useful in identifying problem areas that need to be investigated, they typically do not provide the level of detail needed to solve most problems.
For this reason, many computers often rely on a system tracing facility, which records a historic collection of “events” that occur within a computer. These events are usually implemented by explicit calls from the component software to the system tracing facility, and a user often has the ability to select only certain types of events to trace. Often the amount of data collected by a system tracing facility is exceptionally large, and requires that the events recorded by the system tracing facility be analyzed after the collection is ended, often using relatively sophisticated database query and analysis techniques. Due to the complexity and volume of data, the total elapsed time required from starting a trace to the generation of detailed reports can be significant, e.g., a number of hours, which precludes any generation of results in near real-time.
Another technique that may be used for gathering performance metrics relies on “flight recorders.” A flight recorder is typically a simplified, high performance version of a system tracing facility that is dedicated to a specific software component in a computer. The simplified nature typically means that a flight recorder is much more likely to be able to provide near real-time information on a problem.
A flight recorder, as compared to a system tracing facility, generally collects information regarding a relatively small number of events, and often the events are at a comparatively higher level in the hierarchy of the computer system. In addition, the data collected by a flight recorder is typically buffered only on a temporary basis, and is not permanently stored. Control of a flight recorder is typically implemented by the component being monitored, and as a result, little consistency exists between flight recorders for different components in terms of format, content, size, enablement, and data extraction mechanisms.
The general manner in which a flight recorder is typically used is as follows. When it is projected that a performance problem is likely to occur in the near future (e.g., minutes or hours), flight recorders for any suspected components may be started. Then, when a problem in component X is detected (e.g. from counter and/or timer metrics), the component X flight recorder data may be extracted and analyzed to decode the problem. This extracted data is available in near real-time and can be used to take action and drive other decisions in the process of problem determination. Moreover, the flight recorders often continue to run, thus enabling extraction and analysis of data to be repeated as necessary.
Particularly in complex systems like mainframes and other multi-user servers, the software resident in such systems has been thoroughly instrumented to provide a number of performance metric collection capabilities, with instrumentation of the lower levels of software being primarily accomplished through the use of counters/timers and system tracing facilities. However, with the ever increasing demand for immediate problem identification and resolution in enterprise computing environments, there is an increasing need for the types of capabilities provided by flight recorders.
On the other hand, flight recorders are less predominant in many systems, and often require the efforts of the developers of each component of interest to properly integrate a flight recorder into the component. Furthermore, oftentimes the data collected by a flight recorder is redundant with respect to that collected by a system tracing facility, and as a result, the flight recorder in many ways needlessly increases system overhead and decreases system performance.
Therefore, substantial need exists in the art for a manner of improving the collection of performance metrics in a computer, and in particular, for a manner of collecting more comprehensive data for use in diagnosing problems in a running computer in near real-time.