The present invention relates to computers and, more particularly, to computers that use event logs for diagnostic purposes. The present invention provides an improved event log that better uses limited capacity to retain events of greatest interest.
Much of modern progress is associated with the increasing prevalence of computer and network technology in society. Due to the complexity of computers and networks, errors are inevitable. Accordingly, diagnostic tools are continuously developed to address these errors. In the case of personal computers, an encountered error can be reported to a user when it occurs. For example, a message can be presented to the user, perhaps suggesting corrective action.
On the other hand, computers used as network and Internet servers are not typically attended by users. Accordingly, detected errors are typically compiled by a service processor that logs the errors (and other significant events) in an event log. When the computer is serviced, e.g., after some problem or failure, the event log can be examined to help determine the cause of the problem or failure.
One standard form of event log stores events until the log capacity is filled. Once full, it stops accepting events for logging. In other words, it favors old events over recent events for retention. Such an event log works well for error events that trigger a cascade of other error events. For example, an error event associated with the failure of a network port would be followed by a large number of detections of a failed network port. Error events that are reported after the log is full are discarded. However, the error events associated with the original failure are retained.
However, it is possible that a log would be full before the original error of interest occurs, in which case, that error would not be retained and would not be available for diagnoses. Also, in many cases, late reported errors are of interest. For example, in the event of a computer failure (e.g., “hang”), the last error or last few errors would be of greatest interest. So a system that makes these the least likely to be retained would not be optimal.
The likelihood of overflows can be reduced by using event logs with greater capacity. However, integrated-circuit real estate is limited so that it is not practical to use an event log that is large enough to store all possible events of interest. Also, no matter how great the log capacity, the problem of overflow must be addressed. One seemingly cost-effective approach to increasing effective log capacity is archive of logged events prior to a reset of the event log. Thus, as the event log approaches its capacity limits, its contents are written to hard disk; then the log can be emptied, ready to accept new events. However, archiving normally requires software running on a user processor—and there is no guarantee that it will be installed or properly maintained by the user. Another approach is to reset without archiving—but there is always a risk that the reset will delete events of interest.
In a prior-art approach implemented in a HP9000 K-Class design by Hewlett-Packard Company in some of its servers, a “circular” event log is used. Once the event log is full, new events overwrite the oldest events in the log. This works well for errors that hang a computer, but not so well for events that trigger a cascade of error events. If there are enough follow-on events, the trigger event(s) could be discarded and unavailable for diagnoses. To make it less likely that critical events will be discarded, a second circular log that only retains “severe” errors is used. This can be wasteful as many errors are entered into both logs. Furthermore, there is still the possibility of the severe event log being filled when a cascade of errors occur. What is needed is an improved event log system.