As computers become more complex and powerful, monitoring the overall “health” of a computer becomes a greater concern, particularly when problems occur and the causes of those problems need to be identified and resolved. For this reason, a number of techniques have been developed for collecting information, often referred to as performance metrics, relating to the state of a computer during its operation.
For example, one manner of collecting performance metrics relies upon counters and/or timers that are instrumented into a running system and that provide real-time feedback about the number, type and performance of various processes running in a computer and the resources being utilized by those processes. Counters and timers, however, are usually directed to collecting specific pieces of information, and do not provide a comprehensive set of information about the overall performance of a computer or any of its components. Thus, while counters and timers can be useful in identifying problem areas that need to be investigated, they typically do not provide the level of detail needed to solve most problems.
For this reason, many computers often rely on a system tracing facility, which records a historic collection of “events” that occur within a computer. These events are usually implemented by explicit calls from the component software to the system tracing facility, and a user often has the ability to select only certain types of events to trace. Often the amount of data collected by a system tracing facility is exceptionally large, and requires that the events recorded by the system tracing facility be analyzed after the collection is ended, often using relatively sophisticated database query and analysis techniques. Due to the complexity and volume of data, the total elapsed time required from starting a trace to the generation of detailed reports can be significant, e.g., a number of hours, which precludes any generation of results in near real-time.
Another technique that may be used for gathering performance metrics relies on “flight recorders.” A flight recorder is typically a simplified, high performance version of a system tracing facility that is dedicated to a specific software component in a computer. The simplified nature typically means that a flight recorder is much more likely to be able to provide near real-time information on a problem.
A flight recorder, as compared to a system tracing facility, generally collects information regarding a relatively small number of events, and often the events are at a comparatively higher level in the hierarchy of the computer system. For example, a component may be instrumented to call the flight recorder at exit and/or entry points of routines, at the beginning and/or completion of certain operations, etc. In addition, the data collected by a flight recorder, which is typically organized into “trace points,” is typically buffered only on a temporary basis, and is not permanently stored. Control of a flight recorder is typically implemented by the component being monitored, and much in the same manner as an aircraft flight recorder, a flight recorder logs trace points on a continuous basis such that, after a problem is detected, the flight recorder's log can be reviewed to assist in reconstructing the problem and the potential cause(s) thereof.
The general manner in which a flight recorder is typically used is as follows. When it is projected that a performance problem is likely to occur in the near future (e.g., minutes or hours), flight recorders for any suspected components may be started. Then, when a problem in component X is detected (e.g. from counter and/or timer metrics), the component X flight recorder data may be extracted and analyzed to decode the problem. This extracted data is available in near real-time and can be used to take action and drive other decisions in the process of problem determination. Moreover, the flight recorders often continue to run, thus enabling extraction and analysis of data to be repeated as necessary.
In other instances, flight recorders may be configured to run anytime a computer is operational, thereby providing an on-going log of events that can be evaluated at a later time to reconstruct any problems encountered during operation.
Conventional flight recorders, however, are passive in nature, and are generally limited to logging trace points that are only later analyzed in the event of a problem. The actual detection of errors as they occur, on the other hand, is beyond the scope of conventional flight recorders. Real-time error detection may be left to other logic in a computer, such as watchdog timers and exception handlers that halt execution when problems are detected. Otherwise, computers can become non-responsive and require a reboot, at which time the log of a flight recorder can be analyzed to reconstruct the error.
In some instances, however, real-time error detection may be slow to detect errors in an operational computer or one of its components. For example, in a complex multi-user computer such as a server, some of the sub-systems in such a computer may experience errors that are not readily detected by conventional error detection techniques. As but one example, a removable media sub-system that provides an interface for removable storage devices, e.g., for the purposes of system backups, may experience an error and become non-responsive, but due to the relatively low frequency of use, the error may not be detected for hours, typically when another request is issued when attempting to access the sub-system.
Therefore, a substantial need continues to exist in the art for a manner of improving error detection in a computer and/or a computer's sub-systems or components.