1. Field of the Invention
The present invention pertains to the field of computer systems. More particularly, this invention pertains to the field of detecting and recovering from computer system malfunctions.
2. Background of the Related Art
For many years, computer system manufacturers, computer component manufacturers, and computer users have been concerned with detecting and recovering from computer system malfunctions. There are many reasons why a computer system might malfunction, including memory data corruption, data corruption related to fixed disks or removable media, operating system errors, component errors, components overheating, applications or operating systems performing illegal instructions with respect to the processor, incompatibility between various hardware and software system components, etc.
Some of these types of malfunctions have been effectively dealt with by prior systems. For example, memory data corruption can be handled by parity detection and/or error correcting code (ECC). Illegal instructions can be trapped by the processor and in many cases handled either within the processor or by the operating system. Other malfunctions may result in system "hangs." A system is "hanged" when it is no longer able to respond to user inputs. Some malfunctions that can result in system hangs include operating systems or hardware components entering unknown or indeterminate states, causing the operating system or hardware component to cease normal operation. In these cases, the computer user must restart the computer. Restarting the computer after a system hang can cause problems such as data loss and corruption.
Some prior computer systems have included timers known as "watchdog" timers. A typical watchdog timer implementation involves a processor periodically resetting a timer, and under normal operation the timer never reaches a certain value. If the timer ever reaches the certain value, the computer system is reset. This solution causes no action to take place to attempt to cure the malfunction other than to take the drastic action of resetting the computer system. Resetting the computer system may result in the same problems mentioned above with regard to a user restarting a computer, including data loss and corruption.
Separate error checking processors have been included in computer systems in order to detect and attempt to recover from system hangs. This solution has the disadvantage of being costly. The computer user benefits from less costly computer systems. Therefore, a lower cost method and apparatus for detecting and recovering from computer system malfunctions is desirable.