1. Field of the Invention
The invention relates generally to computing systems. The invention relates more specifically to a method for recovering transparently and automatically from different types of error conditions which may develop inside a digital computer.
2a. Cross Reference to Related Applications
The following copending U.S. patent application(s) is/are assigned to the assignee of the present application, is/are related to the present application and its/their disclosures is/are incorporated herein by reference:
(A) Ser. No. 07/813,891 filed Dec. 23, 1991, by Christopher Y. Satterlee et al, and entitled, IMPROVED METHOD AND APPARATUS FOR LOCATING SOURCE OF ERROR IN HIGH-SPEED SYNCHRONOUS SYSTEMS;
(B) Ser. No. 07/670,289 entitled "SCANNABLE SYSTEM WITH ADDRESSABLE SCAN RESET GROUPS", by Robert Edwards et al, which was filed Mar. 15, 1991.
(C) Ser. No. 07/814,389, entitled "METHOD AND APPARATUS FOR MAINTAINING DETERMINISTIC BEHAVIOR IN A FIRST SYNCHRONOUS SYSTEM WHICH RESPONDS TO INPUTS FROM NONSYNCHRONOUS SECOND SYSTEM, James Millar, et al., filed Dec. 26, 1991.
2b. Cross Reference to Related Patents
The following U.S. patent(s) is/are assigned to the assignee of the present application, is/are related to the present application and its/their disclosures is/are incorporated herein by reference:
(A) U.S. Pat. No. 3,840,861, DATA PROCESSING SYSTEM HAVING AN INSTRUCTION PIPELINE FOR CONCURRENTLY PROCESSING A PLURALITY OF INSTRUCTIONS, issued to Amdahl et al, Oct. 8, 1974;
(B) PROGRAM EVENT RECORDER AND DATA PROCESSING SYSTEM, U.S. Pat. No. 3,931,611, issued to Grant et al, Jan. 6, 1976;
(C) U.S. Pat. No. 4,244,019, DATA PROCESSING SYSTEM INCLUDING A PROGRAM-EXECUTING SECONDARY SYSTEM CONTROLLING A PROGRAM-EXECUTING PRIMARY SYSTEM, issued to Anderson et al, Jan. 6, 1981;
(D) U.S. Pat. No. 4,661,953, ERROR TRACKING APPARATUS IN A DATA PROCESSING SYSTEM, issued to Venkatesh et al, Apr. 28, 1987;
(E) U.S. Pat. No. 4,679,195, ERROR TRACKING APPARATUS IN A DATA PROCESSING SYSTEM, issued to Dewey Jul. 7 1987;
(F) U.S. Pat. No. 4,685,058, TWO-STAGE PIPELINED EXECUTION UNIT AND CONTROL STORES, issued to Lee et al, Aug. 4, 1987;
(G) U.S. Pat. No. 4,752,907, INTEGRATED CIRCUIT SCANNING APPARATUS HAVING SCANNING DATA LINES FOR CONNECTING SELECTED DATA LOCATIONS TO AN I/O TERMINAL, issued to Si, et al. Jun. 21, 1988;
(H) U.S. Pat. No. 4,802,088, METHOD AND APPARATUS FOR PERFORMING A PSEUDO BRANCH IN A MICROWORD CONTROLLED COMPUTER SYSTEM, issued to Rawlinson et al, Jan. 31, 1989;
(I) U.S. Pat. No. 4,819,166, MULTI-MODE SCAN APPARATUS, issued to Si et al Apr. 4, 1989; and
(J) U.S. Pat. No. 4,855,947, MICROPROGRAMMABLE PIPELINE INTERLOCKS BASED ON THE VALIDITY OF PIPELINE STATES, issued to Zmyslowski et al, Aug. 8, 1989.
3. Description of the Related Art
The terms "automatic error recovery" and "fault tolerant" are used here to refer to the ability of a certain class of computers to automatically correct internal error conditions and to continue operations at near-top-speed. When such computers are employed, and correctable error conditions occur, end users receive correct computational results without ever being aware that the error conditions had occurred and that the computer self-corrected these. (It should be noted that not all errors are self-correctable by the machine.)
One previous form of automatic error recovery is referred to as "check-point recovery". This type of recovery is found, by way of example, in the IBM 3080 family of mainframe computers.
In checkpoint recovery, operations within the computer are periodically halted at prescheduled "checkpoints." A back-up copy of the machine state is made at each checkpoint to safeguard against the possibility that an error condition will develop before the next checkpoint.
When the computer is first turned on, system clocks are turned off, a master reset is applied and all parts of the computer are tested to make sure they are error free. A snapshot of the entire machine state is taken and saved in memory. This snapshot is defined as the last-known error-free machine state.
System clocks are then turned on for a brief period of time (e.g., 1,000,000 clock cycles), allowing a burst of operations to take place within the computer. When a first post-reset checkpoint is reached, clocks are again halted and a search is conducted for raised error flags. If no errors are found to have occurred Within the brief run, a snapshot of the new machine state is taken and preserved as the last-known error-free state. Operations are allowed to resume until a second post-reset checkpoint is reached. The process repeats as long as there are no errors.
If any errors are found to have occurred within the a last run, the computer is reloaded with its last-known error-free state and the run is tried again. If the error does not reappear in the retry, as is common with many types of "soft" errors (e.g., noise induced or alpha particle induced errors), users of the computer are left unaware that an error ever occurred. Such recovery is referred to as end-user transparent.
If the error does not go away after a predetermined number of retries, a "machine-check" flag is raised and operations are halted to await high-level intervention (correction by the system operator). This condition is undesirable because the error condition is made very apparent to end-users. Their terminals become nonresponsive and they quickly realize that the computer has been brought "down" by some sort of defect. If shut-downs occur too often, end-users begin to lose faith in the reliability of the machine.
An important feature of checkpoint recovery is that it automatically corrects all sorts of soft errors. Special software does not have to be written for figuring out where in the machine each error occurred or what instruction was being executed when the error arose. Checkpoint recovery inherently provides coverage for all instructions and all errors that are correctable by way of retry. This is an advantageous property of checkpoint recovery.
Unfortunately, checkpoint recovery also comes with a major disadvantage. It inherently slows the computational speed of the computer. This is so because the computer halts at every checkpoint and waits for a snapshot of its machine state to be taken. Overall system performance suffers.
A second form of automatic error recovery has been developed to overcome the performance shortfalls of checkpoint recovery. The second type of recovery is commonly referred to as "instruction retry". It may be found, by way of example, in mainframe computers belonging to the IBM 3090 family.
Instruction retry focuses on the stream of instructions that were most-recently executed by the computer's central processor unit (CPU). The start of each instruction execution is used as a marker for identifying the point in time where the machine first entered an error-infected state.
As instructions stream through the CPU, a record of the most-recently executed instructions is maintained. When an error is detected, it is associated with a particular one of the instructions held in the record of most recently-executed instructions. The state of the computer is stepped back to where it was just before the particular instruction was fetched and executed. The stream of subsequent machine operations is then retried.
A major drawback to the instruction retry approach is that not every error can be readily associated with a particular instruction. If, for example, an error occurs in the circuitry that is responsible for maintaining cache to mainstore coherency, and the error does not arise from an action initiated by a recent CPU instruction, there is no CPU instruction which can be specifically associated with the timing of the cache-coherency error. Also, modern machines have pipelined architectures wherein the execution flows of multiple instructions are moving down the pipeline at the same time. The instruction retry approach has to determine which of the concurrently executing instructions is the one that is to be retried. Stepping back and retrying an arbitrarily selected CPU instruction will not correct a cache coherency error. It will merely slow down the CPU. The system is eventually forced to take a non-transparent machine check for each cache-to-mainstore related error after it is realized that numerous retries of the last CPU instruction do not clear the cache error. Similarly, a non-transparent machine check is eventually taken for all other errors that are not logically associable with a specific instruction. The end result is that users lose access to the machine even for errors which in theory should be self-correctable by the machine.
There is yet another drawback to the instruction retry approach. Specialized hardware is often necessary for resetting or stepping the machine state back out of each peculiar type of partially-executed or fully-executed instruction to the state it was in at the very start of that instruction. Due to cost and other considerations, machine designers tend to take short cuts and build step-back/retry capabilities into the machine only for the more commonly used instructions. Retry coverage is thus provided for only a small fraction (e.g. 15%) of all instructions which may be executed on the machine. Non-transparent machine checks have to be disadvantageously taken for all errors not associable with this small fraction of instructions.
Moreover, there is a growing trend in the industry to improve computational speed through the use of parallelism. Parallel processors may be operating on data stored within a shared memory. If an error condition is detected within the shared memory, the conventional instruction-retry approach is left with the dilemma of not knowing which instruction of which parallel processor is to be retried.