A. Field of the Invention
The present invention relates to error recovery in computer systems. More particularly, the present invention relates to recovery from processing errors caused by AC or timing dependent defects.
B. Related Art
The Unscheduled Incident Repair Action (UIRA) is perhaps the single most important Reliability, Availability and Serviceability (RAS) characteristic. UIRA's are caused by a non-recoverable failure in a critical hardware function which results in the need to bring a customer's system down for repair at an unscheduled time. Circuit failures causing UIRA's can be either AC or DC in nature. DC defects are solid failures which occur whenever a defective circuit is used. AC defects are typically timing dependent and show up only when a timing margin in a logic path is exceeded.
Self-test mechanisms that can distinguish AC defects from DC defects are known in the art. For example, in cases where logic fails a self-test at a first clock speed, it is known in the art to rerun the self-test at a lower clock speed to determine whether the failure was caused by an AC defect or a DC defect. If the self-test passes at the lower clock speed, the failure is identified as having been caused by an AC defect. If the self-test does not pass at the lower clock speed, the failure is identified as being caused by an DC defect. An article entitled "SELF-TEST AC ISOLATION" (IBM Technical Disclosure Bulletin Vol. 28, No. 1, June 1985, pp. 49-51) describes a method to identify the initiating clock pulse of an AC failure, to identify the capturing clock pulse, to identify the capturing storage elements, and to extract the hardware states just prior to and just after the failure for further diagnosis.
While the above test methods provide a means for distinguishing AC defects from DC defects and for fault isolation within a test fixture environment, they do not solve the problem of providing dynamic error recovery or fault tolerance from processing errors caused by AC defects.
Prior art computer systems have been provided with a variety of mechanisms for recovering from processing errors. For example, U.S. Pat. No. 4,912,707 to Kogge et al discloses the use of a checkpoint retry mechanism which enables the retry of instruction sequences for segments of recently executed code, in response to detection of an error since the passage of a current checkpoint. Another example of an instruction retry mechanism is disclosed in U.S. Pat. No. 4,044,337 to Hicks et al.
While such prior art retry mechanisms provide a good means for recovery from soft errors (errors occurring because of electrical noise or other randomly occurring sources which result in non-reproducible fault syndromes), they do not provide recovery from solid or hard errors caused by AC defects (i.e. timing errors which are recurring and consistently reproducible).
Another prior art mechanism for handling processing errors involves the use of redundant processing elements. In such systems, identical instruction streams are processed in parallel by two or more processing elements. When an unrecoverable error is detected in one of the processing elements, it is taken off-line and the other processing element continues to process the instruction stream. One advantage of such redundant processor schemes is that they can handle both "soft" and "solid" or "hard" errors. The disadvantage of such schemes is that providing duplicate processing elements to increase "fault tolerance" significantly increases the cost of the system in terms of parts and manufacture.
Thus, what is needed is an inexpensive mechanism to enable an otherwise conventional computer system to dynamically recover from AC defects.