Enabling systems to run at multiple frequency and voltage levels is a challenging process and requires characterization of the processor to ensure its correct operation at the required operating points. This means ensuring correct process operation with respect to a number of environmental and process related variabilities, such as unexpected voltage drops in the power supply network, temperature fluctuations, gate-length and doping concentration variations, etc. A minimum possible supply voltage for given maximum operating clock frequency, referred to as the critical supply voltage, must ensure correct processor operation of the design when accounting for these variabilities. These variabilities may be data dependent and are composed of local and global components. For instance, local process variations will impact specific regions of the die in different and independent ways, while global process variation impacts the circuit performance of the entire die and creates variation from one die to the next. Similarly, temperature and supply drop have local and global components, while cross-coupling noise is a predominantly local effect.
With the technology scaling, the local component of environmental and process variation is becoming more and more prominent and the sensitivity of circuit performance to these variations is even higher at lower operating voltages. For example, variation in circuit speed increases significantly. Increases in delay variations may occur due to variations of distances between logic gates (assuming gates are locally more correlated, i.e. spatially correlated). Similarly, random variations in timing become more and more dominant.
Assuring correct operation of a design translates to assuring correct operation of its timing critical paths. A timing critical path is, for example the longest path between (i) an input and a first sequential element, (ii) two sequential elements (between two clocked flip-flops) or (iii) a sequential element and an output, and is characterized with the smallest path slack, which is the difference between the maximal propagation delay and the clock period. The timing critical path defines the maximum operating clock frequency. For example, if the frequency is too high (i.e. the logic between the two sequential elements is too slow for the given frequency) the data signal will arrive too late on the input of the next (also referred as receiving) sequential element to be captured properly resulting in erroneous operation. This problem is called a setup timing violation.
Any design of a plurality of circuits or circuit blocks (or chips) has timing critical paths. Even when designed to meet (easily) the timing requirements, these paths will fail first, for example, when the design is in a (very) slow process corner or when dynamic frequency and voltage scaling is applied (DVFS). As modern technologies (such as 40 nm and below) designs suffer from high process variability, which are worsened when operating at reduced supply voltages, for marginal voltage-frequency combinations some chips will work, others (with slower devices) will not. What is more, due to within-die variations, the first failing path will differ from die to die.
To guarantee a correct operation of a digital design, a timing closure analysis is performed which adds a certain margin to the minimal supply voltage and/or the maximal frequency to account for variations in the timing behaviour of the fabricated circuits and for effects not covered in the timing closure analysis. These can be due to intra- or inter-die process-voltage-temperature (PVT) variations, e.g., supply voltage drops which may also very over time, temperature fluctuations and ageing. Thus, traditional DVS techniques, for example, as disclosed by M. Nakai et al., “Dynamic Voltage and Frequency Management for a Low-Power Embedded Microprocessor,” IEEE J. Solid-State Circuits, vol. 40, no. 1, January 2005, use canary circuits to mimic the critical path delay of the actual design. However, the canary circuits require significant voltage safety margins (adding up to 50% of the total energy budget) to guarantee computational correctness at the worst case combination of intra-die process variations, and local fluctuations in voltage and temperature, leading to a loss in energy efficiency. Further, they also have difficulty responding to rapidly changing conditions. However, these approaches cannot compensate for mismatches in tracking across PVT between the actual critical path and the modelled paths, die variations between location of monitor path and critical path, random variations between monitor path and critical path, response times of monitor circuit on fast changing conditions, aging (at different rate), and therefore require additional safety margins to the critical voltage.
Razor-based DVS techniques have been proposed, for example D. Ernst et al., “Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation”, MICRO-36, December 2003, pp. 7-18, based on dynamic detection and correction of circuits timing errors. It proposes the usage of an error detection and correction mechanism to eliminate the safety margins due to intra-die and local PVT variations while tolerating a limited number of errors. These techniques allow reducing automatically the supply voltage to the point of first failure (PoFF). However, this technique does not offer a good trade-off between the overhead due to re-computation and saved energy. Thus, only a small amount of failures are tolerated. Further, since the error correction hardware is part of the circuitry (sequential elements), its area and power footprint is increased.
Razor II has the main advantage of reducing the overhead for recomputing by reusing hardware already available in common processors (e.g. wrong prediction). These shortcomings are partly solved by the new Razor approach as disclosed by D. Blaauw et al., “Razor II: In Situ Error Detection and Correction for PVT and SER Tolerance,” ISSCC Dig. Tech. Papers, pp. 400-1, February 2008, by detecting failures of the actual circuit and triggering re-computation. This allows lowering of the supply voltage and/or frequency to the PoFF (or even up to a failure rate of 0.1% of all computed results) which eliminates the safety margins. However, this approach has proved to be problematic in observing all potentially functional failing flops.
In more recent work, patent application US 2009/0031268 A1 proposes an in-situ canary circuit—a combination of classical Critical Path Monitors and the in-situ Razor approach. Here, the sequential elements of the most timing critical paths are duplicated (canary circuits). The duplicates have an increased delay on the data, thereby making it more timing critical. Similarly to the Razor approach, the in-situ canary circuits observe the actual timing critical path; it exactly tracks across PVT. However, the tracking of random variation is limited to the duplicated storage elements. It still has to preserve some margin, for extreme cycle-to-cycle timing variations due to changes in the actual data processing.
In another example, Martin Wirnshofer et al: “A variation-aware adaptive voltage scaling technique based on in-situ delay monitoring” DESIGN AND DIAGNOSTICS OF ELECTRONIC CIRCUITS & SYSTEMS (DD ECS), 2011 IEEE 14TH INTERNATIONAL SYMPOSIUM ON, IEEE, 13 Apr. 2011, discloses an adaptive voltage scaling (AVS) scheme in combination with in-situ delay monitoring. The AVS scheme reduces the supply voltage as long as there are no critical timing events observed during an observation interval. It can be assumed that the voltage is scaled based on a scenario A. In the next observation interval, the system runs in a second scenario B in which the critical timing path is longer than that in scenario A. In this case, the voltage level will be insufficient to support error free operation in scenario B. The system remains at the voltage level of scenario A, generating multiple errors until a signalling threshold is reached (critical timing event occurs) that leads to a correction in supply voltage. Since, scaling down of the voltage is based on the first signalling event only, this approach is prone to errors, and for signalling rates in the order of seconds, this approach would be unsuitable for dynamic tracking.