1. Technical Field
The present invention relates to a method and system for fault detection in general, and in particular to a method and system for detecting faults in a processor. Still more particularly, the present invention relates to a method and system for handling detected faults in a processor to improve reliability of a data-processing system.
2. Description of the Prior Art
As personal computers and workstations are utilized to perform more and more substantial applications that were formerly reserved for mainframes, system availability and data integrity become increasingly important for these "smaller" computers. However, expensive fault-tolerant techniques and elaborate internal-checking hardware are seldom available in these "smaller" computers because of cost.
In the prior art, a technique known as lock-step duplexing is utilized to assure data integrity in lower priced computers. With lock-step duplexing, two processing elements are utilized for fault detection; when a mismatch is found between the two processing elements, the computer system immediately comes to a halt. In certain aspects, this is a very safe methodology as it assumes that all occurred errors are permanent. But at the same time, the associated cost of this methodology can also be very high because there is usually a long downtime for each outage. This is particularly true when the majority of errors that occurred in the field are transient in nature, making such methodology seemingly over-conservative.
As an improvement, some lock-step duplexing systems are enhanced by utilizing a "retry." More specifically, if there is a mismatch, both processing elements are retried and the result comparison is performed again. The computer system will be halted when there is a second mismatch. Accordingly, the technique of lock-step duplexing with retry can be utilized in fault detection and recovery for transient errors also. Due to the high occurrence rate of transient errors, lock-step duplexing systems with retry tend to have higher system availability than lock-step duplexing systems without retry. Still, there is a concern about data integrity exposures in all systems that are based on lock-step duplexing technique. Such concern stems from common-mode errors.
Common-mode errors (either permanent or transient), which may occur in any peripheral component of the computer system, such as memory, bus, etc., can potentially feed both lock-stepped processing elements with the same bad data and cause a data integrity violation without being detected. Consequently, it would be desirable to provide an improved and yet reasonably economical method for the detection, reporting, and recovery of transient errors in a computer system.