1. Field of the Invention
The present invention relates to a data processing apparatus and method for detecting an approaching error condition at a time before an actual error occurs within the data processing apparatus.
2. Description of the Prior Art
The scaling of the size of components within data processing systems has long been a source of dramatic performance gains. In particular, developments in CMOS technology have enabled ever smaller feature sizes to be realised, which has enabled smaller circuits to be built exhibiting increased performance. However, it is also then desirable to reduce the operating voltage of such circuits, so as reduce power consumption and also decrease operating temperatures. However, the reduction in voltage levels has not been able to match the rate of feature size scaling because of limits in threshold voltage scaling, leading to increasing operating temperatures and current densities.
Further, as the size of the circuit elements is reduced, there has been an increase in variability in the components produced using the advanced CMOS technology now available, and as a result on-chip variation has become a key component in determining the performance and associated power consumption achievable within a data processing system.
As a result, it is common to employ margining methods during the timing analysis and sign off of a data processing system design. These margining methods aim to account for process, voltage and temperature variations occurring both globally (i.e. affecting the elements in a chip in a correlated manner) as well as locally (i.e. affecting each portion of the chip differently). The margining techniques also aim to account for effects such as device mismatch, crosstalk, IR drop, ageing related effects as well as delays in the timing due to single event transients (also often referred to as single event upsets (SEUs)). The necessary margins are added either by analysing or optimising the design at tighter performance targets (i.e. a higher frequency of operation) or worse operating conditions (i.e. lower voltage and/or higher temperature conditions) than will actually occur in reality, so that when the apparatus is then used in the real environment, it can reliably operate at required performance levels and in required operating conditions. Alternatively, timing derating methods can be used to seek to account for the necessary margins, where a timing engine is used to derate various launch and capture paths within the design based on the on-chip variation. In particular, derating is generally performed by a tool, either at the cell or transistor level, that performs timing analysis. The timing path is scaled to account for on (or across) chip variation causing timing to vary due to process, temperature and voltage variations. Thus the timing engine empirically budgets for larger delays through a path by assuming it to be longer than it actually computes. This “artificial” increase is called a timing-derate.
Although such margining methods make the data processing system design robust against timing failures, they result in a lot of performance that cannot be utilised unless one resorts to techniques such as speed binning during the post manufacturing test/characterisation.
As process geometries shrink, the unacceptable performance and power impact of such pessimistic design margining has lead to an increased interest in adaptive techniques. Adaptive techniques seek to eliminate a significant portion of safety margins by dynamically adjusting system parameters such as supply voltage, body bias, and operating frequency to account for variation in environmental conditions and silicon grade.
The traditional methods of adaptive design have used look-up tables or so-called “canary” circuits. In the look-up table based approach, the design is pre-characterised to obtain voltage and frequency pairs for which correct operation is guaranteed. This approach exploits periods of low CPU utilisation by dynamically scaling voltage and frequency, thereby obtaining energy savings. However, each operating point must be suitably margined to guarantee computational correctness in the worst-case combination of process, voltage and temperature (PVT) conditions.
The canary-circuit based approach eliminates a subset of these worst-case margins by using a delay-chain which mimics the critical path of the actual design. The propagation delay through this replica path is monitored and the voltage and frequency are scaled until the replica path just about fails to meet timing. The replica path tracks the critical path delay across inter-die process variations and global fluctuations in supply voltage and temperature, thereby eliminating margins due to global PVT variations. However, the replica-path does not share the same ambient environment as the critical path because its on-chip location differs. Consequently, margins are added to the replica path in order to budget for delay mismatches due to on-chip variation and local fluctuations in temperature and supply voltage. Margins are also required to address fast changing transient effects such as coupling noise which are difficult to respond to in time with this approach. Furthermore, mismatches in the scaling characteristics of the critical path and its replica require additional safety margins. These margins ensure that the processor still operates correctly at the point of failure of the replica path.
To eliminate worst-case safety margins, ARM Limited developed a novel voltage and frequency management technique for Dynamic Voltage and Frequency Scaled (DVFS) processors, based on in-situ error detection and correction, called Razor. The basic Razor technique is described in U.S. Pat. No. 7,278,080, the entire contents of which are hereby incorporated by reference. In accordance with this technique, a delay-error tolerant flip-flop is used on critical paths to scale the supply voltage to the point of first failure (PoFF) of a die for a given frequency. Thus, all margins due to global and local PVT variations are eliminated, resulting in significant energy savings. In addition, the supply voltage can be scaled even lower than the first failure point into the sub-critical region, deliberately tolerating a targeted error rate, thereby providing additional energy savings. Thus, in the context of Razor, a timing error is not a catastrophic system failure but a trade-off between the overhead of error correction and the additional energy savings due to sub-critical operation.
Other papers that describe adaptive techniques are the following:
Razor II: In Situ Error Detection and Correction for PVT and SER Tolerance—IEEE Journal of Solid-State Circuits (JSSC), Vol 44, No. 1, January 2009;
Energy-Efficient and Metastability-Immune Resilient Circuits for Dynamic Variation Tolerance—IEEE JSSC, Vol 44, No. 1, January 2009;
A Simple Flip-Flop Circuit for Typical-Case Designs for DFM—ISQED 2007;
Reducing Pipeline Energy Demands with Local DVS and Dynamic Retiming—ISLPED 2004;
Fine Grain Redundant Logic Using Defect Prediction Flip-flops—ISSCC 2007;
A Power-efficient ARM ISA Processor using Timing-error Detection and Correction for Transient-error Tolerance and Adaptation to PVT Variation—ISSCC 2010 and
“Hardware Self-Tuning and Circuit Performance Monitoring”, by T Kehl, Department of Computer Science and Engineering, University of Washington, Seattle, published 1993.
The prior art listed above are primarily based on techniques which seek to detect performance failures in the functional element through late arrival of timing signals, with the need for re-evaluation of the logic path sensitised through replaying the operation/operations that failed.
Whilst techniques which detect performance failures, and then replay the operation/operations that failed, can significantly improve performance, they increase complexity by requiring the design to incorporate rollback and replay mechanisms in the event that errors are detected. Further, various data processing systems will have a requirement for correct operation at all times, with that requirement outweighing absolute performance, and, would find it acceptable to relinquish some of the performance available from a Razor-type system, if it could be guaranteed that the system would always operate correctly, and accordingly there would be no requirement to incorporate rollback or replay mechanisms. However, there is still a need to improve the performance relative to the earlier-discussed margining techniques.