1. Technical Field
The present invention relates to a fault-tolerant computer system and more particularly, to the of use shadow registers for instruction rollback in a fault-tolerant data processing system.
2. Background Art
Data processing systems have historically suffered from both soft errors and hard errors. Soft errors are usually defined as those deviations from the expected output which occur because of electrical noise or other randomly occurring sources which result in nonreproducible fault syndromes. Hard errors are typically associated with electrical components or mechanical component failures producing errors which are reproducible. Many arrangements for fault-tolerant data processing systems have been developed in the prior art. A typical example of a fault-tolerant system is the provision of two or more identical data processing elements to operate on the same instruction stream having their outputs compared with one another. When a difference is detected in the outputs of a pair of data processing elements, it can be inferred that either a soft error or a hard error has occurred. Typically in the prior art, the data processors are then restarted and the instruction stream is then executed in a stepwise manner until the error is detected again. If no error occurs, then the initial error determination is that of a soft error. If the error is repeated when stepping through the instruction stream, then the instruction at which the error occurs can be identified. This prior art approach to the retry of instructions after the detection of a fault is a lengthy one. The prior art has not been found a suitably efficient or fast technique for the retry of instructions after fault detection. An example of a checkpoint retry mechanism is described in U.S. Pat. No. 4,912,707 to Kogge et al., Mar. 27, 1990, commonly assigned to IBM Corporation and whose teaching are incorporated herein by reference.
It is known that to permit circumvention of errors, at least one valid starting point must exist at all times during the normal operation of a data processing system. This valid starting point is a requirement which enables the recovery routine subsequent to a fault to conduct a return to the object program and therefore, maintain the functional integrity of the data processing system.
To accomplish rollback within a central processing unit to its last known good state before an error is detected, the traditional method relies heavily on software. The software task includes storing the processing unit's state in a dedicated memory location periodically throughout the normal instruction processing. This technique is prohibitively time consuming for many applications. One obvious alternative to the software solution, is to store the processing unit's state in dedicated hardware. The drawback to this approach is the large amount of hardware required. What is needed is an improved method and apparatus to provide fault-tolerant checkpoint and retry within a data processing system.