The present invention generally relates to data processing, and more particularly, to fault recovery in a data processing system.
A popular design for central processing units is reduced instruction set computer (RISC) processors using a pipeline architecture. With pipeline architecture, the tasks performed by a processor are broken down into a sequence of functional units referred to as stages or pipeline stages. Each functional unit receives one or more inputs from the previous stage and produces one or more outputs which may then be used by the subsequent stage. Thus, one stage""s output is usually the next stage""s input. Consequently, all of the stages are able to work in parallel on different, although typically sequential, instructions in order to provide greater throughput.
Typical stages of a RISC processor pipeline include instruction fetch, register fetch, arithmetic execution, and write-back to registers. In order to improve performance, the pipeline receives a continuous stream of instructions fetched from sequential locations in memory using addresses that are typically stored in a program counter or other suitable device. When several instructions are being concurrently processed in the pipeline and each pipeline stage is performing its designated task, a single instruction can be executed approximately every clock cycle. This design offers greater efficiency than other architectures, such as complex instruction set computer (CISC) architectures, which require more than one clock cycle to execute an instruction.
Because of its many advantages, only a few of which are discussed above, the RISC architecture enjoys a wide variety of applications including those with safety critical implications such as health care, transportation, military, space, and some manufacturing environments.
The increased reliance on RISC processor-based automated data processing systems in safety critical applications raises the need for the system to be dependable; that they perform their expected task(s) correctly with a high degree of confidence. Design for dependability is one of the many drivers that define the specifications of the RISC processor-based system. Fault avoidance, removal, and tolerance are three approaches that improve system dependability. Fault avoidance is usually achieved by processes and methods used to generate the design of the system such as adherence to proven design and development processes or the use of formal methods to validate the correctness of a design. Fault removal is usually achieved by extensive system testing. A fault is removed once it is discovered during system test. Fault tolerance is achieved by incorporating features in the design that enable continued correct system operation in spite of the occurrence of a fault.
A fault may be permanent or transient. A permanent fault is one that causes the RISC processor""s behavior to permanently deviate from its specifications, and typically requires human intervention to ameliorate its effect. A transient fault, on the other hand, causes the behavior deviation for a limited time period. The processor typically resumes its behavior as specified once the cause of the fault disappears and the effect of the fault is removed from the system.
Transient faults are typically caused by an event in the processor""s physical environment. For example, in an industrial application, many transient faults are due to the electrically noisy environment where equipment switching causes voltage spikes that impact the processor""s power supply, and thus causing a transient fault in the microelectronics circuitry that make up the processor. A single event upset (SEU) is yet another cause of transient faults in the microelectronics circuitry of a RISC processor. SEUs are usually caused by a natural or man-made radiation particle that changes the state of a processor by altering its memory content, such as a bit in one of its data or control registers, while it travels through space. Once the radiation particle passes through the circuitry, it no longer affects the microelectronics device. In either of these cases, as well as others, the transient fault may cause the processor to exhibit an error in its processing. In safety critical applications, declaring a processor as permanently failed due to a transient fault may not be a suitable course of action for reasons such as the lack of spare processors to continue operation. This is particularly true in space applications where processors are expected to operate for an extended time period to justify the cost of the mission. Thus, given the existence of transient faults in certain computing environments, it is desirable to be able to detect and recover from transient faults as quickly and as efficiently as possible so that the performance of the processor is not significantly hampered or degraded.
The impact on the performance of a processor from a transient fault depends upon the overhead associated with recovery. Two factors which largely control the overhead of transient fault recovery are: (1) the time spent to continually gather the data necessary in anticipation of recovering from a transient fault, and (2) the actual recovery time, i.e., the time it takes the processor to remove the effect of the fault from its memory and to be ready to resume correct operation. Following are discussions of several techniques used for transient fault recovery.
A relatively common technique for transient fault recovery is checkpoint retry in which the current state data of a program is saved in a memory cache at various points in the execution of the program code. These points are referred to as checkpoints. Checkpoints are taken at the software level where the program is modified to permit the capture of checkpoints and the rollback to a suitable checkpoint during recovery. Typically, only the values of program variables that changed since the last checkpoint are stored at a next checkpoint. When an error is detected, the program state is restored (also referred to as rolled back) to the last checkpoint that preceded the error in the instruction stream. The amount of roll back necessary to reach the nearest checkpoint is called the rollback distance. The rollback distance may be measured by the number of instructions the effect of which must be nullified to reach the nearest checkpoint. Execution is resumed from the checkpoint once the program state is restored from the data stored at the checkpoint. A drawback to this technique is the complexity of the code necessary to allow the data to be gathered at each checkpoint. Another drawback is the relatively high overhead on system performance. The performance overhead of checkpoint retry is largely due to the overhead required for storing the data associated with all the instructions between consecutive checkpoints. This same data is also restored during an actual recovery which, likewise, is time consuming. Further, if more frequent checkpoints are used in order to reduce the amount of data which must be stored at every checkpoint and then restored in case of an error, then more of the processor""s time is spent performing error checking. In computing applications requiring control based on precise time intervals, the recovery time spent error checking and rolling back to a checkpoint can be difficult to determine apriori. Finally, in environments where the processor""s next task depends upon changes in its physical environment, such as the firing of a jet to correct a spacecraft""s attitude or the reaction to a change in the state of a stage in a manufacturing assembly line, recovery times must be bounded to prevent the processor from reacting to a set of environmental conditions that does not truly reflect the processor""s physical environment. The analysis and determination of proper recovery time bounds is very difficult.
Another recovery technique referred to as instruction retry is a variation on the checkpoint retry scheme in that the rollback distance is reduced to one instruction. In essence, a checkpoint is obtained prior to the execution of an instruction. The output of the processor is checked for correctness after the execution of the instruction. A detected error causes a checkpoint retry. The processor""s state is restored to that which it was prior to the execution of the instruction, and the instruction is fetched again from the instruction memory for a re-execution. This approach minimizes the amount of data saved at every checkpoint by saving the values of variables that would be changed through the execution of the next instruction. The error detection mechanism is typically a comparator that compares the processor""s outputs to those of a redundant processor in a master/checker (or duplicate) configuration. The processor""s internal memory and register devices that would be affected by the upcoming instruction execution must also be saved by the checkpoint in order to restore the processor""s execution environment correctly after rollback. While the recovery time for this technique is very short, its performance overhead is high. The program state data must be saved prior to the execution of an instruction. The output of every instruction is validated. The program""s state is restored once recovery is initiated in reaction to a detected error. Establishing a checkpoint and validating every instruction reduces the processor""s throughput.
Another recovery technique is to use a hardware-based checkpoint retry mechanism, also referred to as a micro-rollback mechanism. This technique is similar to the checkpoint retry techniques discussed above except that additional hardware is added for automating the storage of state information and data within the processor. Consequently, a disadvantage to a hardware checkpoint retry mechanism is that it utilizes valuable on-chip space for the additional hardware required, essentially denying its use to enhance the processor""s functionality and deliver maximum performance. Further, if the error latency is high, more hardware is required to implement the micro-rollback mechanism because more processor data and state information is used and modified after the execution of the instruction in which an error occurs, but before the error is detected. Hardware based checkpoint retry mechanisms are further described in numerous publicly available writings such as, for example, in Y. Tamir et al., xe2x80x9cThe Implementation and Application of Micro Rollback and Fault-Tolerant VLSI Systems,xe2x80x9d 18th Fault-Tolerant Computing Symposium, Tokyo, Japan, pp. 234-239, June 1988, where micro rollback is applied to a RISC processor.
The checkpoint retry technique and all its variants implement a recovery strategy that commits the results of one or more computational steps or instructions then react by rolling back the effect of these once an error is detected.
Forward error recovery is another recovery technique that does not rely on restoring the state of a program to one of its previous states, as captured by a checkpoint. Forward error recovery techniques reset the program state to a predetermined initial state based on the kind and location of an error in the code. This technique reduces the performance overhead due to the absence of checkpoint and restoration activities. However, it does introduce uncertainty in the robustness of recovery. The risk is that resetting the program state may not be appropriate for recovering from the particular error. It is very difficult to determine the proper reset data in reaction to every possible error, unless the system is trivially simple. Recovery is typically managed at the program level.
Therefore, a heretofore unresolved need existed in the industry for a recovery system and method that provides improved recovery from transient faults, such as in a pipelined RISC processor, with minimal performance and hardware overhead.
It is therefore an object of the present invention to provide improved transient fault recovery.
It is another object of the present invention to provide improved transient fault recovery by committing the results of a computation after it is verified to be correct.
It is yet another object of the present invention to reduce the recovery time from transient faults in a reduced instruction set computer (RISC) processor.
It is yet another object of the present invention to provide transient fault recovery systems and methods that can be easily integrated in a RISC processor with minimum hardware and performance penalty.
These and other objects of the present invention are provided by a transient fault recovery system. Processor state changes based on the execution of an instruction are not committed until the instruction is validated. The processor state data related to an instruction, i.e., the instruction state data, are saved as it moves through the pipeline stages. The instruction state data must contain all data necessary to enable the re-execution of the instruction beginning with the first pipeline stage, i.e., the instruction fetch stage in the RISC processor. If an error is detected in the execution of an instruction, then a copy of that instruction is retrieved using the address used to fetch the instruction previously, i.e., the instruction fetch address. The pipeline""s upstream stages, i.e., the pipeline stages that are processing instructions that have been fetched subsequent to the instruction that is found to be in error, are purged. The processor is then restarted after its state is reset to a state where the execution can resume without the effect of the transient fault.
The verification of the correct execution, i.e., the validation, of an instruction is performed by one or more error detection mechanisms at every pipeline stage by using appropriate error detection techniques. For example, a simple parity check may be sufficient to validate the correct processing of the instruction by the instruction fetch stage. Parity check may also be sufficient to validate the instruction at the output of the register fetch stage. The output of the execution stage may also be validated by using a master/checker (or duplicate) configuration of the arithmetic transform operators, e.g., the arithmetic logic unit (ALU), and comparing the results. Alternatively, a checker with reduced functionality may be used to minimize the amount of hardware needed and provide an acceptable level of error detection. The amount of acceptable error detection logic at each pipeline stage depends on many factors, including the desired level of fault coverage and the amount of physical space available on the microcircuit to incorporate the logic. Detecting an error during validation will terminate the current execution of the instruction and will not commit any of its results.
The validation at each pipeline stage may take place within that stage, if it can be accomplished within the remaining part of the clock period used by the stage. For example, if the processor""s clock period is N units of time and part of this period, say M units of time where M less than N, is consumed by the stage to generate its output(s), then the remaining units of time, (i.e., Nxe2x88x92M), can be used to validate these outputs. However, if the number of remaining units of time, (i.e., Nxe2x88x92M), is not sufficient for the validation, then enhancing the RISC pipeline with additional pipeline stages dedicated to validating the output of every stage may be necessary. In the worst case, these validation stages may be introduced between the instruction fetch and register fetch stages, between the register fetch and the execution stages, and/or between the execution and write-back stages.
In an embodiment of this invention, only the execution stage may need to be followed by a dedicated validation stage. This is due to the complex logic used by the arithmetic transform operators within the execution stage. A substantial part of the processor""s clock period is consumed by this logic to generate results. This is unlike the instruction and register fetch stages where sufficient time would typically be left for the validation within the processor""s clock period. Naturally, selecting a clock period that is optimal between the number of pipeline stages and the ability to validate a stage""s output within the processor""s clock period is critical to determining the processor""s effective throughput. A sufficiently long processor clock period allocated to that stage eliminates the need for an additional validation stage, but may reduce the processor""s throughput. A sufficiently short processor clock period would, in the worst case, require the addition of a validation stage after the instruction fetch, register fetch, and/or execution stages. The pipeline grows longer where the number of its stages increases. For example, a four-stage pipeline may grow to as many as seven stages. Therefore the processor""s effective bandwidth may be reduced. This is because of the effect of the execution of branch instructions on the pipeline""s stages as can be recognized by those skilled in the art.
In particular, a data processing system for re-executing an instruction if an error is detected in the prior execution of that instruction comprises the program counter value used to fetch the instruction, i.e., the instruction fetch address. It also includes a memory device referred to herein as the pipeline history cache, which contains the instruction fetch address. The data processing system may also include circuitry to detect errors such as parity code checkers or arithmetic transform operators (e.g., the Arithmetic Logic Unit (ALU)). Alternatively, the arithmetic transform operator circuitry of the data processing system may contain logic for at least partially executing the instruction.
In an embodiment of this invention, the error detection technique used in the execution stage relies on using master/checker configurations of the arithmetic transform operators where a comparator is necessary to validate the instruction prior to committing its results through the write-back stage. The comparator may compare the results of the executed instruction with reference results from the checker. The master/checker pair may be included in the chip""s circuitry but would be located far apart to prevent common mode failures. It is noted that the checker may be another processor, provided the processor""s clock period is sufficiently long to permit the data from the master and checker to travel to the comparator and for the comparator to produce the result. Error recovery logic is also necessary to reset the program counter to the instruction""s fetch address stored in the pipeline history cache. As such, the instruction under execution, at the time an error is detected, can again be fetched and processed through the pipeline""s stages.
In accordance with an aspect of the present invention, the transient fault recovery system may abort further execution of the instruction if an error is detected when validating the execution of the instruction. Further, the transient fault recovery system aborts the execution of all instructions in the pipeline subsequent to the instruction where an error is detected.
In yet another aspect of the present invention, a method of error recovery in a data processing system having a memory device, such as a pipeline history cache typically formed by a first-in, first-out (FIFO) cache, comprises of the steps of fetching an instruction using an instruction fetch address, storing the instruction fetch address in the memory device, at least partially executing the instruction, and validating the execution of the instruction prior to implementing a state change based upon the execution of the instruction. Moreover, if an error is detected in the execution of the instruction, then the method may also comprise the step of fetching of the instruction utilizing the instruction fetch address stored in the memory device and re-executing the fetched instruction. The step of validating the execution of the instruction may comprise the step of comparing the results from the execution of the instruction with reference results from, for instance, a redundant microcircuit.
In addition, the method of the present invention may also advantageously include the step of aborting the execution of the instruction if an error is detected in the step of validating the execution of the instruction. Similarly, the method may include the step of aborting the execution of instructions subsequent to the current instruction in the pipeline if an error is detected in the step of validating the execution of the current instruction.
Other features and advantages of the present invention will become apparent to one skilled in the art upon examination of the following drawings and detailed description. It is intended that all such features and advantages be included herein within the scope of the present invention, as defined by the appended claims.