1. Field of the Invention
The present invention relates to a data processing apparatus and method for providing fault tolerance when executing a sequence of data processing operations.
2. Description of the Prior Art
Many applications for modern data processing systems require mechanisms to be put in place to detect occurrences of faults. For example, many safety critical applications require data processing systems with in-built fault tolerance to ensure any errors in operation are quickly detected. Within a data processing system, both permanent and transient faults may occur. For example, as systems become smaller and smaller, the reduced pitch and wire width can significantly increase the probability of occurrence of an undesired short or open circuit, causing a permanent fault in a system.
Similarly, transient faults, also called single event upsets (SEUs), may occur due to electrical noise or external radiation. Radiation can, directly or indirectly, induce localised ionisation events capable of upsetting internal data states. While the upset causes a data error, the circuit itself is undamaged and the system experiences a transient fault. The data upsetting is called a soft error, and detection of soft errors is of significant concern in safety critical applications.
A data processing system will typically comprise processing circuitry for performing a sequence of data processing operations, and one or more storage structures used to store data manipulated by the data processing circuitry during the execution of those data processing operations. One known technique for providing fault tolerance against permanent or transient errors is to employ redundancy within the data processing system, as for example illustrated schematically in FIG. 1.
As shown, in addition to the processing circuitry 10, a redundant copy of the processing circuitry 20 is provided. Both the processing circuitry 10 and the redundant copy 20 execute the same code, and accordingly perform the same sequence of data processing operations. One way of operating such a data processing apparatus is in a lock-step architecture, as for example described in the article “Fault-Tolerant Platforms for Automotive Safety-Critical Applications” by M Baleani et al, Proceedings of the 2003 International Conference on Compilers, Architecture and Synthesis for Embedded Systems, Oct. 30 to Nov. 1, 2003, San Jose, Calif., Pages 170 to 177. In accordance with a lock-step architecture, both the processing circuitry 10 and the redundant copy 20 execute the same code and are strictly synchronised so as to execute the code at the same rate, with or without a fixed timing offset. The processing circuitry 10 (often referred to as the master) has access to the system memory and drives all system outputs, whilst the redundant copy 20 (also referred to as the checker) continuously executes the same instructions as the master, with the outputs produced by the checker being input to comparison logic that checks for consistency between the outputs from the master and the outputs from the checker. When these outputs do not match, this reveals the presence of a fault in either the processing circuitry 10 or the redundant copy 20, thereby alerting the system to the presence of a fault.
Such comparison circuitry does not detect bus and memory errors, which can in fact be a source of common-mode failure causing both the processing circuitry 10 and the redundant copy 20 to fail the same way. Accordingly, as shown in FIG. 1, the bus 35 and storage structures 30 (such as the memory) can be protected against faults by deploying error detection (correction) techniques such as error correcting codes (ECCs).
The benefits of the redundant circuitry approach such as described in FIG. 1 are that it provides robust fault tolerance, is simple to build (in that the redundant copy 20 is merely a complete replica of the processing circuitry 10), and the fault detection has no speed impact on the operation of the processing circuitry 10. However, one disadvantage of such an approach is that it requires a relatively large area (due to the need for the redundant copy 20), and is costly in terms of power consumption, due to the operation of the redundant copy 20. Issues can also arise with regard to the timing requirements to keep the processing circuitry 10 and the redundant copy in lock-step.
An alternative approach to fault tolerance is described in the articles “A Fault Tolerant Approach to Microprocessor Design” by C Weaver et al, Dependable Systems and Networks (DSN), July 2001, and “DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design” by T Austin, University of Michigan, appearing in MICRO 32: Proceedings of the 32nd annual ACM/IEEE international symposium on Microarchitecture, 1999, pages 196 to 207, Haifa, Israel. In particular, both of these papers describe a testing approach called dynamic verification, where a checking mechanism is inserted into the retirement stage of a complex microprocessor. In accordance with the described approach, a core processor employing a high degree of speculative execution executes a sequence of instructions, and when those instructions have been completed, their input operands and results are sent in program order to the checking mechanism, referred to therein as a checker processor. The checker processor follows the core processor, verifying the activities of the core processor by re-executing all program computations in its wake. However, the high-quality stream of predictions from the core processor serves to simplify the design of the checker processor and speed its processing. In particular, the checker processor can perform many of the operations in parallel, since by the time the checker processor re-executes all of the program computations performed by the core processor, all processing hazards have been eliminated and hence the checking process can execute without speculation.
When compared with the earlier-described approach using replicated processing circuitry executing in lock-step with the main processing circuitry, such an approach can result in a smaller area and reduced power consumption, due to the reduction in complexity of the checker processor. Further, since the design of the checker processor is entirely different to that of the core processor, there is the potential for detecting additional faults that might not be spotted by pure replicated processors. In addition, some of the timing complexities can be reduced due to the checker processor's operation following that of the core processor. However, designing such a core processor and associated checker processor is a complex task, due to the need to separately design the checker processor in addition to the core processor, which will preclude the use of such an approach in many applications.
Another known approach is the reduced-area, redundant CPU system (fault-robust (fR) CPU) produced by Yogitech, where the fault distribution and effects are analysed within a CPU, and then a checker CPU is produced which is customised for the particular application and which generates and compares the results required for high fault coverage. As with the earlier-described dynamic verification approach, the resulting system may be more optimal in area and power consumption terms than a pure replicated CPU approach, but requires significant work to analyse the fault distribution and effects of the CPU and to design the resultant checker CPU.
It would be desirable to develop a fault tolerant system which retained the simplicity of utilising the redundant copy of processing circuitry to provide fault tolerance, but which provided reduced area and power consumption when compared with known redundant copy techniques.