This invention relates to a method for identifying errors in a programmed digital computer and for correcting the identified errors. In particular, this invention relates to a method for monitoring instructions and data that cause errors, analyzing the monitored instructions and data to predict errors and for preventing future errors from occurring, for example by inserting corrective software.
MICROSOFT Corporation""s Dr. Watson is a debugging tool that logs information regarding internal operations of the operating system xe2x80x9cWINDOWSxe2x80x9d into a failure report. Dr. Watson logs the information after any application software (typically called just xe2x80x9capplicationxe2x80x9d) encounters an error, that MICROSOFT calls xe2x80x9cunrecoverable application error (UAE).xe2x80x9d See, for example,. xe2x80x9cAn Annotated Dr. Watson Log File,xe2x80x9d KB:Windows SDK KBase, Microsoft Development Library, MICROSOFT Corporation, One Microsoft Way, Redmond, Wash.; xe2x80x9cPostmortem Debugging,xe2x80x9d Matt Pietrek, Dr. Dobb""s Journal, September 1992; and xe2x80x9cException Handlers and Windows Applications,xe2x80x9d Joseph Hlavaty, Dr. Dobbs Journal, September 1994; all of which are incorporated by reference herein in their entirety.
Briefly, a Dr. Watson failure report contains information on (1) the name of an application that failed, (2) the error encountered, such as xe2x80x9cExceed Segment Bounds (Read),xe2x80x9d (3) the instruction""s address at which the failure occurred, (4) the instruction that caused the failure, (5) the contents in various registers, such as CPU registers, instruction pointer (also called xe2x80x9cprogram counterxe2x80x9d), stack pointer, base pointer, code segment selector, stack segment selector, data segment selector, extra segment selector, 32-bit registers and flag bits (e.g. Overflow bit, Direction bit, Sign bit, Zero bit, Carry bit, Interrupt bit, Auxcarry bit and Parity bit), (6) WINDOWS installation and environment information, (7) stack frame information such as disassembled instructions surrounding the failed instruction, and several levels of nested function calls leading to the failed instruction, (8) names of all tasks when the failure occurred and (9) user response typed into a xe2x80x9cDr. Watson""s Cluesxe2x80x9d dialog box.
MICROSOFT Corporation recommends that a user exit WINDOWS after a UAE occurs, and if exiting is not possible, to restart the personal computer. See xe2x80x9cThe DrWatson and MSD Diagnostics,xe2x80x9d KB:Windows 3.x KBase, Microsoft Development Library, MICROSOFT Corporation, One Microsoft Way, Redmond, Wash., also incorporated by reference herein in its entirety. MICROSOFT Corporation further recommends that after a UAE occurs, the user should run MICROSOFT DIAGNOSTICS (MSD) that identifies system configuration information, such as the BIOS, video card type, manufacturer, installed processor(s), I/O port status, operating system version, environment settings, hardware devices attached, and additional software running concurrently with MSD. Id. All of these actions can result in loss of valuable data, as well as valuable time before a user can continue using the application.
MICROSOFT Corporation also recommends that after logging several UAES, the user should send the log to MICROSOFT Corporation, although MICROSOFT Corporation cannot respond to log contributors. Id. Therefore, the user receives no assistance in identifying the problem that caused the UAE and in fixing the application to avoid that particular UAE in future. Moreover, Dr. Watson appears to log only an application""s UAEs failures, and cannot be used for debugging other errors, such as errors in the operating system or errors in hardware.
Errors in hardware can be debugged using a built-in xe2x80x9cdebugxe2x80x9d part of the type present in INTEL""s P6 (also called xe2x80x9cPentium Proxe2x80x9d) microprocessor. INTEL recommends the P6""s debug port as an aid for designing a system board on which the CPU is mounted. See, for example, xe2x80x9cIntel equips its P6 with test and debug features,xe2x80x9d Electronic Engineering Times, Oct. 16, 1995, n870, pages 1-2, that is incorporated by reference herein in its entirety.
Briefly, the P6 debug port is typically connected to an xe2x80x9cin-target probexe2x80x9d (ITP) via a 30-pin connector, and allows access to boundary-scan (JTAG) and built-in-self-test (BIST) structures on the P6 microprocessor. Through an ITP such as ICE-16 available from, for example, American Arium, Tustin, Calif., board designers can control program execution, set break points, monitor the P6""s access of registers, memory and input-output devices.
However, a typical user neither has access to an ITP nor the expertise needed to use the ITP. Therefore, the user is still unable to identify the problem that causes a UAE and unable to fix the application to avoid known UAEs in future.
In accordance with the invention, a central processing unit (CPU) repeatedly interrupts execution of software to save the CPU state, i.e. contents of various storage elements internal to the CPU, until an error occurs during the execution. On occurrence of the error, the CPU once again saves state and only then passes control to a handler in the software for handling the error. Each time the CPU state is saved at locations in memory different from the previous time so that a sequence of CPU states is saved when control passes to the handler. The storage elements whose contents are saved can be of two types: (1) accessible, and (2) inaccessible to the executing software, such as an operating system or an application. Moreover, the above-described state saving steps can be implemented, in different embodiments of the invention, in hardware (e.g. as a state machine) or in software (e.g. in basic-input-output-system (BIOS), in an operating system, as a device driver, or as a utility). In one specific embodiment, the state saving steps are implemented in a computer process by use of x86 instructions.
1 The x86 instruction are instructions executable by microprocessors compatible with microprocessors in the 8086, 80286, 80386, 80486, Pentium and Pentium Pro (P6) families of microprocessors available from Intel Corporation, Santa Clara, Calif. 
In one embodiment, errors are debugged off-line in a development system, for example, by use of an in-circuit emulator to load the saved CPU states sequentially into the development system, thereby to recreate the error condition. If the frequency of the saved CPU states is too coarse to find the source of the error, the CPU states can be saved more frequently, e.g. after shorter time periods, on every jump instruction, on every input-output instruction, on every function-call instruction, or on some combination these events, depending on one or more flags. The flags can be set, for example, in a configuration file that is checked at the startup of the computer process. The sequence of saved CPU states allows recreation of error conditions otherwise not possible in the prior art. Moreover, the CPU states are saved transparent to the software, thereby allowing recreation of errors in an operating system as well as errors from interaction between the operating system and an application, both of which were not possible in the prior art.
In accordance with the invention, an error can also be debugged proactively by a computer process, even before the error occurs, by use of a number of known-to-be-erroneous instructions and fix instructions corresponding to the known-to-be-erroneous instructions. In one embodiment, the CPU compares instructions to be executed with each of the known-to-be-erroneous instructions, and on finding a match, injects the corresponding fix instructions into the to-be-executed instructions. In this embodiment, these proactive error debugging steps are executed by the state saving process optionally depending on a flag that is set or cleared, for example, in a configuration file. In another embodiment, the proactive error debugging steps are implemented in a different process that executes independent of the state saving process, i.e. does not save CPU states.
Therefore, well known errors e.g. the 80286 jump bug or the PENTIUM arithmetic bug are easily avoided, e.g. by inserting a no-op instruction before a jump instruction or by replacing one arithmetic instruction with another arithmetic instruction. Such proactive debugging allows a user to continue to use, for example, a defective PENTIUM or defective software and not have any known errors. Moreover, if an error has not yet been debugged, the handler can add an erroneous instruction to the known-to-be-erroneous instructions with a corresponding temporary-fix instruction to gracefully terminate the application, e.g. if the erroneous instruction is known to crash (e.g. xe2x80x9cfreezexe2x80x9d) the CPU. Such graceful termination of the application allows the CPU to continue execution of other software that may be of value to a user, e.g. to eliminate the need to reboot the operating system otherwise required in the prior art.