This invention relates to retry mechanisms for digital computers and digital data processors for enabling such computers and processors to automatically and correctly recover from intermittent and transient hardward-generated errors.
Various methods have been heretofore proposed for enabling digital data processors to recover from intermittent and transient errors. One previously proposed approach is to stop the data processing completely upon detection of an error and restarting over again from the beginning. This restarting is accomplished by reloading the program by means of an initial program load. Another known method uses a programmed retry technique called "checkpointing". In this method, every program is written to incorporate retry provisions which include insertion of checkpoints within a computer program and instructions for saving all system data and control information at each checkpoint until the next checkpoint is reached. When an error occurs, the system is returned under program control to its condition at the last checkpoint. A further known approach makes use of error detection circuits which monitor the exceution of instructions. Each instruction has a threshold point after which execution may not be retried because the source data has been changed, If the error occurred before the threshold point has been reached for the particular instruction being executed, the processor is immediately stopped and a retry of that instruction is carried out. If, however, the threshold has been passed, no retry is possible.
U.S. Pat. No. 3,533,065, granted on Oct. 6, 1970 to McGilvary et al, entitled "Data Processing System Execution Retry Control" and assigned to the International Business Machines Corporation, described the use of buffers to hold information which will be necessary to reset the data processor to a previous correct state for performing a retry after occurrence of an error. Preservation of the information necessary for retry is controlled by additional fields that are added to all of the microwords in the control section which controls execution of the programmed instructions.
U.S. Pat. No. 3,533,082, granted on Oct. 6 1970 to Schnabel et al, entitled "Instruction Retry Apparatus Including Means for Restoring the Original Contents of Altered Source Operands" and assigned to International Business Machines Corporation, describes a mechanism which achieves retry of instructions by providing a back-up store for storing all source operands which may be destroyed when results are stored back to memory. When an error is detected, the control unit returns the system to the beginning of the instruction to be retried. When the point is reached during the retry at which the source operands are to be addressed, a storage data change indicator is interrogated and, if any source operand was changed during the prior erroneous execution attempt, the source operands are taken from the back-up store and set into their correct locations in the main store. U.S. Pat. No. 3,736,566, granted on May 29, 1973 to Anderson et al entitled "Central Processing Unit with Hardware Controlled Checkpoint and Retry Facilities" and assigned to International Business Machines Corporation, describes the use of temporary storage registers associated with the main store, the general purpose registers and the floating point registers for storing copies of data operands before they are modified during the processing of instructions. If the data processing system must recover to some known condition, the temporary copies can be used to restore the contents of the main storage locations, general purpose registers and floating point registers to reflect the values of the operands at the time of the known condition.
U.S. Pat. No 3,786,430, granted on Jan. 15, 1974 to Hajdu et al, entitled "Data Processing System Including a Small Auxiliary Processor for Overcoming the Effects of Faulty Hardware" and assigned to International Business Machines Corporation, describes the use of a relatively small auxiliary error processor associated with a relatively large main processor, with the error processor being operative in the case of an error to intercept the function of the main processor and to compute a correct partial or final result which is then returned to the main processor. Thus, the auxiliary error processor simulates the function of a malfunctioning unit of the main processor for enabling the main processor to produce the correct results and to continue its processing of data.
Codending U.S. patent application Ser. No. 486,043, filed on July 5, 1974 now U.S. Pat. No. 4,020,466, issued Apr. 26, 1977 in the names of Cordi and Edson, entitled "A Memory Hierarchy System with Journaling and Copyback" and assigned to the International Business Machines Corporation, describes a hierarchical memory system having two memory units on each level of the hierarchy. One of the units, called the data store, contains all the data at that level of the memory. The other unit, called the copyback store, contains all the changes that have been made in the data at that level of the memory. The data store and the copyback store on each level are on two different power supply systems so that if either power supply system fails, a complete record of the data is nevertheless available. A journal is maintained at each level for listing the addresses for the change data in the order in which they were entered into the copyback store. This journal listing enables the changed data in the copyback store to be transferred to the next lower level in the memory hierarchy in the order of its age, with the oldest changed data being transferred first.
The mechanisms described in the above-cited patents and patent application provide generally improved performance in accomplishing their desired objectives in their particular operating environments. There is, however, room for further improvement in the retrying of erroneously executed or failed instructions. And it is a primary purpose of the present invention to provide a goodly measure of such further improvement.
For a better understanding of the present invention and its various advantages and features, reference is made to the following description taken in connection with the accompanying drawings, the scope of the invention being pointed out in the appended claims.