1. Field of the Invention
This invention relates to data processing systems having provision for error detection and recovery. More particularly, this invention relates to data processing systems which recover from errors by returning to and executing from a checkpoint in the instruction stream.
2. Related Art
The occurrence of random errors in data processing systems can have a substantial adverse effect on reliability and performance. A random or "soft" error is a deviation from the expected output or result which is not due to a continuing fault in the system. These random errors are generally characterized by non-reproducible fault syndromes caused by electrical noise or other randomly occurring sources. In contrast, "hard" errors are typically associated with electrical or mechanical component failures resulting in errors which are reproducible.
In early systems, random errors were handled by resetting the system and re-executing the affected program from the beginning. As programs became longer, this solution became unacceptable. Thus, in later systems, programs were segmented into parts such that errors could be handled by rerunning the program only from the beginning of the part in which the error occurred. As data processing systems and operating systems became more complex (due to, for example, multi-tasking) and time dependent, even this later approach became inadequate.
U.S. Pat. No. 4,901,233 to Liptay discloses a system which can execute instructions concurrently and out-of-sequence. A large number of instructions can be in some state of execution at any given time. While instructions can be executed out-of-sequence in such systems, the results of most instructions are not made visible to the rest of the system until the instruction is "architecturally completed". Completion of the instruction occurs when the execution of the instruction is finished, all exceptions associated with the instruction have been tested for and reported, and the instruction is the next one to logically complete (i.e. instructions complete in their logical program sequence). The challenges of recovering a Central Processor (CP) based on such a design from an error include the following:
1) determining which of the many instructions currently in process were affected by the error;
2) determining a known good state before the error occurred and and returning the CP to that state; and
3) modifying the state of the machine in a manner consistent with the logical program sequence while retrying the instructions.
FIG. 1 shows six hypothetical instructions in various states of execution. Each division on this chart represents a clock cycle or machine cycle. Execution of one machine instruction can take one to many machine cycles depending on what function is to be accomplished, availability of data and resources, and other factors. The instructions are numbered in the order in which they were programmed, i.e. Instruction 0 is followed by Instruction 1 which is followed by Instruction 2 etc. Instruction 0 begins execution in cycle 1 and completes in cycle 3. Instruction 1 begins in cycle 2 but does not finish execution until cycle 6. Although Instructions 2 and 3 finish before Instruction 1, they cannot complete until Instruction 1 completes in order for the program to execute in its logically programmed order.
U.S. Pat. No. 4,044,337 to Hicks et al. entitled "INSTRUCTION RETRY MECHANISM FOR A DATA PROCESSING SYSTEM" and assigned to the International Business Machines Corporation describes an error recovery mechanism that establishes a checkpoint at the beginning of the execution of a machine instruction. Since instructions are executed out-of-sequence in the system described in Liptay's patent, application of this technique to this system would require checkpointing up to the maximum number of instructions capable of being outstanding at one time and sorting out the order to reexecute the instructions. The amount of checkpoint information to accomplish this would be large and the probability that the checkpoint information itself was damaged by the error would therefore increase, lowering the overall ability of the processor to recover from the error.
A further complication is introduced by the potential of instructions executed out-of-sequence to be cancelled if a previous, yet unfinished instruction results in a change in the assumed instruction sequence. This could be due to a branch resolving differently than was expected or an instruction encountering an exception or an asynchronous interruption (for example an I/O, external, or repressible machine check interrupt or an operator request) becoming pending or enabled during some point in the processing. This would also have to be sorted out during the retry process to determine which checkpoints to save and which to discard.
Given that in Liptay's processor design some instructions may have already finished but not yet completed, another approach could be to save the results of the finished instructions and apply Hick's algorithm to only those instructions not yet finished.
In FIG. 1, if an error occurs in cycle 5, instruction 4 has begun, instruction 3 is just finishing and instruction 2 is finished although not yet complete. Since the error could have potentially affected one, some, all or none of the instructions in process, instructions 1, 3, and 4 need to be redone. Although instruction 2 has finished, trying to preserve its results has the following problems:
1) the logic to determine which instructions are finished and then sort out which need to be redone is prohibitively complex;
2) changes in the assumed instruction sequence would still have to be sorted out, discarding or saving checkpoints and finished results in this case instead of only checkpoints; and
3) any soft error in the logic which retains the finished status of an instruction would make the instruction unretrievable since this information would be required to retry successfully.
Additionally, some operations executed by processors designed to meet the specifications described in the Enterprise Systems Architecture/390 Principles of Operations (ESA/390 POP), require or are implemented such that results are made visible to and possibly cause a reaction in other system facilities as part of their execution and before they complete. For example, in some computer systems, communication between the I/O subsystem and the Central Processor is accomplished via interlocked updates to data areas in storage known as control blocks.
An interlocked update consists of first fetching a particular part of the control block called a lock byte using a hardware interlocked fetch. A hardware interlocked fetch, also called obtaining a hardlock, is a mechanism which insures that the requestor of such a fetch has exclusive rights to the data and no other requestors may store into or fetch that data until the interlock has been released. Once the hardlock is obtained, an identifying tag known as a softlock is placed into this byte, the lock byte is stored and the hardlock is released. Any other elements wanting to use the data covered by this softlock will see the tag in the lock byte and will not use the data until the lock byte is cleared. Both obtaining the softlock and releasing it require updating storage before the instruction has completed. Additionally, once the processor has control of the softlock, it will proceed to update other areas in the control blocks thereby initiating the desired I/O operation and changing the state of the I/O Subsystem. If an error occurs during one of these I/O instructions, storage has been altered and the effects potentially seen by other system elements. The altered storage location also may have caused a reaction in another processing element. Thus, simply backing up the processor to its state before or at the start of that operation will not nullify the effects of the execution even though the instruction did not complete. Special handling of these types of operations is therefore required when a soft error is encountered. It is desired however, that the number of special cases to be handled be minimized to reduce the number of algorithms to be developed, maintained and tested.
Two other examples of such operations which updated system facilities before they have completed are certain implementations of Page In and Page Out to Expanded Store. During execution of these instructions, a page of data is moved into or out of main storage, respectively. For some implementations, paging occurs before the instruction completes. The data which is moved can thus be seen by processing elements other than the Central Processor which executed the instruction. If an error occurs while this instruction is executing, the results cannot be undone by backing up the processor to its state before the execution of this instruction.
As an additional aid in error recovery, it is known in the art to provide a computer system with a checkpoint register. Such a facility is described in an article entitled "CHECKPOINT REGISTER" (IBM Technical Disclosure Bulletin, Vol. 27, NO. 4A, September 1984, pp 2231-2232). The checkpoint register, located in the instruction element of a central processor, can be set to any value by the Central Processor microcode via an emit field. The checkpoint register is reset at the beginning (successful first cycle) of each instruction. The register can be tested by CP microcode for retry and detailed error reporting.
When an error in the CP occurs, the CP clocks are stopped. A maintenance and support service facility (MSSF) then scans the registers and triggers in the CP including the checkpoint register. The MSSF tests an error trigger which denotes checkpoint register parity error. If this trigger is on, the MSSF considers the error unretriable, a system damage (SD) indication is set, and the CP is set to the checkstop state. The MSSF then checks the Checkpoint register. A value of 64 through 255 indicates system damage. The MSSF then tests the non-retriable triggers in the CP. If retriable, the system damage will only be reported when the retry threshold is reached and a machine check interrupt is taken.
The value contained in the Checkpoint register tells the MSSF what type of operation was in progress in the CP when the error occurred, as well as whether or not special action is required by the MSSF to return the CP to a logically consistent state, or that the CP cannot be brought back to a consistent state except by a system reset (i.e. a "System Damage" condition exists).
In summary, the Checkpoint Register is a means of communicating to the MSSF the internal state of the CP for purposes of error recovery (checkpoint retry), and error reporting to the program (machine check interruption).
While the above described "prior art" Checkpoint Register provides many advantages in terms of error recovery, it does not provide for crisp transitions synchronized with the completion of checkpoints. In the above described register, data that is entered into the register is immediately available to the MSSF. In systems such as described in U.S. Pat. No. 4,901,233 to Liptay however, where checkpoint transitions are marked by instruction completion, the data presently in the register may be incorrect or premature. Thus, a mechanism is needed to ensure that the data provided to the MSSF is neither obsolete nor premature.