Reliability of computer systems is a critical issue in such fields as telephone switching offices, air traffic control, nuclear power plants, and the like. The central processing unit (or "CPU", as it is known in the art) must be highly reliable in these applications; one "fault" or error in the instruction execution stream can cause damage far beyond the troubles that such faults cause in, for example, personal computers. Therefore, such applications require CPU's with fault recovery systems that can recover to a known state when faults occur.
Current fault recovery techniques assume a single execution stage CPU, that is, a CPU that executes one instruction at a time. FIG. 1 illustrates the typical instruction flow in a single execution stage CPU. In a single execution stage CPU, instruction N is executed during time period X. Next, instruction N+1 is executed during time period X+1. Instruction N+2 is executed during time period X+2, etc.
FIG. 2 illustrates how this serial instruction stream is interrupted by the occurrence of a fault, and by subsequent fault recovery operation. During execution of instruction N, a fault occurs at time Y. Upon recognition of the fault, the CPU executes its fault recovery operation as the next operation. When the fault recovery operation is completed, instruction N is executed again in time period X+2, and then instruction N+1 in time period X+3, and so forth.
One assumption that is made in the above scenario is that this single execution stage CPU is capable of minimal fault detection and isolation latency such that propagation of collateral damage resulting from the fault is also assumed to be minimized, and further that the execution of instruction N+1 is assumed to be postponed until the successful execution of instruction N. One technique used in the art is to have a log that records the operations and effects of the instruction stream as the instruction are being executed in the CPU. The fault log thus captures the state of the CPU while it was executing the instruction that incurred the fault. The fault log information is then used by the fault recovery operations to reconstruct deterministically the flow of execution that lead up to the fault.
A further assumption made in single stage CPU, and illustrated in FIG. 3, is that faults appearing in different parts of the instruction stream are separated by time and are thereby independent; thus, each fault can be safely dealt with separately. This is illustrated in FIG. 3 by the execution of instruction N during time period X. A first fault occurs at time Y. The fault recovery operation occurs during time period X+1. Instruction N is then repeated, and then instruction N+1 is executed. During the execution of instruction N+1, a second fault occurs at time Z. The second fault recovery operation occurs during time period X+4. Instruction N+1 is then repeated, and then instruction N+2 is executed during time period X+6. Thus, the requirement of minimal detection and isolation latency applies equally to each independent fault in each time-separated stage of execution. Likewise, each invocation of the fault recovery operation can safely assume that the single fault log has recorded the state of the CPU for each of the time-separated, independent faults. The fault recovery operation follows the same procedure to reconstruct deterministically the flow of execution that lead up to each of these faults.
A further requirement of computer systems used in telephony, power plants, etc., is speed. More and more critical tasks are being performed on such computer systems; thus requiring faster and more powerful CPU's. One solution to the speed problem is to use parallel processing CPU's which can perform several operations on the instruction stream simultaneously. FIG. 4 illustrates the instruction flow in a CPU that supports multiple instructions in multiple execution stages during an overlapping time interval (known in the art as "pipelined"). In actual work accomplished, the operations executed in FIG. 4 is equivalent to the operations executed in FIG. 1. However, the elapsed time necessary to complete the tasks is reduced. FIG. 4 illustrates that, when instruction N finishes its execution in the first stage, it advances to the second execution stage and releases the first execution stage for use by instruction N+1. While the instructions are not complete until all tasks assigned to the instruction are complete, the instruction's occupancy of an execution stage is terminated when the task performed by that stage is completed. A stage that has completed its task is then available for the next instruction. Thus, each stage may operate on a different instruction simultaneously.
Fault recovery of a pipelined CPU is more complex, because the execution of instruction N+1 (and perhaps even N+2) may or may not be postponed by the occurrence of a fault in instruction N. This scenario is illustrated in FIG. 5. If instruction N encounters a fault at time Y, and then executes the fault recovery operation, and further if instructions N+1 and N+2 are not postponed by this fault, then there is the strong possibility that instructions N+1 and/or N+2 (or later instructions) hide the damage caused by the fault in instruction N, because damaged data from instruction N is likely to be used in instruction N+1 and/or N+2. Further, it is also possible that these later instructions may also have faults of their own.
In FIG. 5, which is an illustration of the current art, fault recovery operations reconstruct the flow of instructions N leading up to the first fault, but not the states of the instructions following this fault and prior to the actual execution of the fault recovery operation itself. Thus, a significant amount of collateral damage may be missed. Such an approach is taken because it is assumed that damage following the initial fault and prior to the actual execution of the fault recovery operation are coincidental and of low frequency. This assumption also implies that if collateral damage is severe enough that it may lead to system-affecting incidents, it will be detected and corrected by more drastic fault recovery steps when and if such collateral damage occurs. Such fault recovery is not acceptable for multiple execution stages in mission-critical applications, such as those listed above.
Therefore, a problem in the art is that there is no CPU that can provide both the speed of multiple execution stages and the security of complete fault recovery found in single execution stage systems of the prior art.