1. Field of the Invention
The present invention pertains to the field of computer systems. More particularly, this invention pertains to the field of recovering from computer system malfunctions.
2. Background of the Related Art
For many years, computer system manufacturers, computer component manufacturers, and computer users have been concerned with detecting and recovering from computer system malfunctions. There are many reasons why a computer system might malfunction, including memory data corruption, data corruption related to fixed disks or removable media, operating system errors, component errors, components overheating, applications or operating systems performing illegal instructions with respect to the processor, incompatibility between various hardware and software system components, etc.
Some of these types of malfunctions have been effectively dealt with by prior systems. For example, memory data corruption can be handled by parity detection and/or error correcting code (ECC). Illegal instructions can be trapped by the processor and in many cases handled either within the processor or by the operating system. Other malfunctions may result in system xe2x80x9changs.xe2x80x9d A system is xe2x80x9changedxe2x80x9d when it is no longer able to respond to user inputs and/or is not able to respond to system events including, but not limited to, incoming network traffic, etc. Some malfunctions that can result in system hangs include operating systems or hardware components entering unknown or indeterminate states, causing the operating system or hardware component to cease normal operation. In these cases, the computer user must restart the computer. Restarting the computer after a system hang can cause problems such as data loss and corruption.
Some prior computer systems have included timers known as xe2x80x9cwatchdogxe2x80x9d timers. A typical watchdog timer implementation involves a processor periodically resetting a timer, and under normal operation the timer never reaches a certain value. If the timer ever reaches the certain value, the computer system is reset. This solution causes no action to take place to attempt to cure the malfunction other than to take the drastic action of resetting the computer system. Resetting the computer system may result in the same problems mentioned above with regard to a user restarting a computer, including data loss and corruption.
Separate error checking processors have been included in computer systems in order to detect and attempt to recover from system hangs. This solution has the disadvantage of being costly. The computer user benefits from less costly computer systems. Therefore, a lower cost method and apparatus for detecting and recovering from computer system malfunctions is desirable.
A method for recovering from a computer system lockup condition is disclosed. In one embodiment of the method, as interrupt is generated to the computer system""s operating system notifying the operating system of the lockup condition. An operating system interrupt handler is then executed. The interrupt handler performs at least one step to attempt to cure the lockup condition. If the interrupt handler fails to cure the lockup condition, the interrupt is regenerated to the operating system notifying the operating system of the lockup condition. The interrupt handler is then re-executed in response to the regeneration of the interrupt, with the interrupt handler performing a further step in attempting to cure the lockup condition.