Historically, most software has been designed under the assumption that it will never fail. Software has little or no error detection capability designed into it. When a software error or failure does occur, it is usually the computer operating system that detects the error or the computer operator cancels the execution of the software program because the program is not producing the correct results.
Upon a software error or failure, it is most common to capture the entire storage area used by the software program. However, because the error occurs before the program failure surfaces, the key data required to determine the cause of the error or failure is often lost or overlaid and it is not obvious how the path of execution got to the point where the error finally surfaced. Most program developers use a process called tracing to isolate a software error.
To use tracing, trace points must be placed within the problem program in order to sample data through the path of execution. The problem is recreated, if possible, and data from the trace points are collected. Unfortunately, tracing has some bad side effects including the following: 1) tracing requires a branch to a trace routine every time a trace point is encountered, often resulting in a significant impact to not only the problem program's performance but to other programs executing on the same system; 2) tracing requires large data sets to contain the volumes of data generated by trace points; 3) for the programmer that uses tracing to capture diagnostic data, he invariably finds himself sifting through the large amounts of data, the majority of which does not reflect on the cause of the error; 4) if the software problem was caused by a timing problem between two programs (e.g., two networking programs communicating with each other), the trace can slow the program execution down to the point where most timing problems cannot be recreated; 5) after the program problem is fixed, all the trace points must be removed and the program recompiled before it can be released commercially.
After all the data is collected, the cause of the program error is determined and the fix is generated and tested, another problem still faces software support personnel. If this problem occurs in another copy of the same software executing on another processor, how can it quickly be determined so resources aren't wasted trying to resolve problems that have already been resolved? This is another drawback of current methodology. When attempting to determine whether a software problem has already been discovered, reported, and/or fixed, software support personnel will rely on a problem description from the person that encountered the error or failure. However, different people will describe the same problem with different problem symptoms, making it difficult, it not impossible, to identify an already-known problem and match it up with the existing solution. A software programmer may spend several hours or days reviewing diagnostic data for a software problem only to find later that the software problem had been reported and resolved at an earlier time.
Typical of the prior art are U.S. Pat. Nos. 4,802,165, 3,551,659 and 3,937,938. U.S. Pat. No. 4,802,165 issued to Ream discloses a method and apparatus for debugging computer programs by using macro calls inserted in the program at various locations. The system records trace outputs from the program under test with the trace output being generated unconditionally, that is, whether or not an error has occurred. U.S. Pat. No. 3,551,659 issued to Forsythe discloses a method for debugging computer programs by testing the program at various test points and automatically recording error status information on an output device. Each chosen test point causes the invoking of an unconditional transfer to an error checking routine. U.S. Pat. No. 3,937,938 issued to Matthews discloses a method for assisting in the detection of errors and the debugging of programs by permanently storing in a memory structure the sixteen most recently executed program instructions upon the occurrence of a program malfunction.
In summary, the existing process for software error correction embraces a methodology which waits for the damage caused by a software error to surface. Then, data trace points are inserted, the problem is recreated, and the execution path of the problem program is followed while large amounts of data are collected, hopefully catching the data that will determine what went wrong.