1. Technical Field
This invention relates to the field of computer systems and in particular to an apparatus, method and program for recording diagnostic trace information in computer systems.
2. Background
The concept of recording diagnostic trace information in computer systems is well known in the art. Conventionally, trace tables are data buffers that are filled as a program executes with time-stamped status information about what the program is doing. In the computer system and application environment, trace tables are a key instrument to allow problem determination during development and for field support.
Presently trace tables in many computer systems are stored in non-persistent memory. In certain system environments, memory size is very limited and is primarily reserved for customer application data.
The use of memory for the trace tables is limited by the cost of memory and its physical size. This limits the amount of memory usable for trace table data. This trace table size limitation leads to the trace tables wrapping, sometimes within seconds depending on system and load conditions. This can cause valuable trace information to be lost as a problem develops over a given period. The loss of such trace information slows or even prevents problem determination. Furthermore, when comprehensive program tracing is required, the likelihood of the vital trace information being overwritten is increased.
One conventional solution to the problem of analysis being frustrated by key information in the trace table being overwritten by more recent entries is to devise various alternative ways to retry the failed task but to stop the code closer to the event that caused the problem. This has the disadvantage that it is time consuming and requires problem recreation which is not always possible.
A further proposal for improvement is the use of various “levels” of interest of trace entries. For example, one might assign trace events to three levels, in increasing perceived importance: Information, Warning and Error. Each might be preserved in different storage areas, and thus mere informational message entries would be less likely to overwrite more important error message entries. The disadvantage here is that even this degree of separation is not always sufficient, as a “root cause” error trace entry can be overwritten quite rapidly by the cascade of entries corresponding to further errors that it has caused in dependent or subsequent operations performed by the system.
It would thus be desirable to provide a method and apparatus for recording trace information in computer systems in such a manner that records for rarely-occurring events most likely to be indicative of the true root cause of a problem may be retained longer than records for events less likely to be indicative of the root cause of a problem.