Technical Field
This disclosure relates to asynchronous circuits and to their design.
Description of Related Art
Traditional synchronous designs may incorporate timing margin to ensure correct operation under worst-case delays caused by process, voltage, and temperature (PVT) variations as well as data-dependency, K. Bowman, J. Tschanz, N. S. Kim, J. Lee, C. Wilkerson, S. Lu, T. Karnik, and V. De, “Energy-Efficient and Metastability-Iimmune Resilient Circuits for Dynamic Variation Tolerance,” IEEE JSCC, vol. 44, no. 1, pp. 49-63, January 2009. Both synchronous and asynchronous designs have been proposed to address this problem.
--Asynchronous Solutions
Different asynchronous templates have been proposed to address increased variation in delay (e.g., A. Yakovlev, P. Vivet, and M. Renaudin, “Advances in Asynchronous Logic: From Principles to GALS & NoC, Recent Industry Applications, and Commercial CAD Tools,” in DATE, March 2013, pp. 1715-1724). Quasi-delay-insensitive (QDI) templates may use completion signal logic, which may make them robust to delay variations at the cost of increased area and high switching activity due to a return to zero paradigm, P. Beerel, R. Ozdag, and M. Ferreti, A Designer's Guide to Asynchronous VLSI. plus 0.5 em minus 0.4 emCambridge University Press, 2010. Bundled-data templates (e.g., micropipelines, I. E. Sutherland, “Micropipelines,” Commun. ACM, vol. 32, no. 6, pp. 720-738, June 1989) may use delay lines matched to single-rail combinational logic, providing a low area, low switching activity asynchronous solution (e.g., J. Cortadella, A. Kondratyev, L. Lavagno, and C. Sotiriou, “Desynchronization: Synthesis of asynchronous circuits from synchronous specifications,” IEEE Trans. on CAD, vol. 25, no. 10, pp. 1904-1921, October 2006). However, the delay lines may need to be implemented with sufficiently large margins in the presence of on-chip variations, reducing the advantages of this approach. Researchers have proposed different solutions to mitigate these margins, such as duplicating the bundled-data delay lines. I. J. Chang, S. P. Park, and K. Roy, “Exploring Asynchronous Design Techniques for Process-tolerant and Energy-Efficient Subthreshold Operation,” IEEE JSSC, vol. 45, no. 2, pp. 401-410, February 2010, constraining the design to regular structures such as PLAs, N. Jayakuma, R. Garg, B. Gamache, and S. Khatri, “A PLA Based Asynchronous Micropipelining Approach for Subthreshold Circuit Design,” in DAC, 2006, pp. 419-424, and using soft latches, J. Liu, S. Nowick, and M. Seok, “Soft Mousetrap: A Bundled-Data Asynchronous Pipeline Scheme Tolerant to Random Variations at Ultra-Low Supply Voltages,” in ASYNC, May 2013, pp. 1-7.
--Razor I, II, and Lite
As low-power designs become more prominent, dynamic voltage scaling has gained popularity to reduce energy consumption in synchronous circuits. However, increased margins due to variability in gate delays at low voltages can be a major concern with this approach. Razor-type architectures aim to alleviate the performance impact due to these increased margins by adding error detection and correction circuits to the design, D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flautner, and T. Mudge, “Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation,” in Microarchitecture, 2003. MICRO-36. Proceedings. 36th Annual IEEE/ACM International Symposium on, December 2003, pp. 7-18. The original Razor design utilizes a “Razor flip-flop”, which consists of a main flop connected to an early clock and a second latch connected to a late clock. Subsequently, the input data is double-sample by the two different clocks and the values of both the main flop and secondary latch are compared to determine if an error occurred. When an error is detected, the “good” value that was sampled later is re-latched into the main flop, which is then passed back into the datapath. At the system-level, a pipeline controller may stall or flush instructions in previous stages to prevent data contamination. This operation may require tight integration into the original design to ensure that instructions can be reliably stopped, flushed, and replayed without impacting overall data integrity. In this design, the performance penalty is theoretically limited to one cycle; however, in practice the implementation of the Razor correction circuits in high-speed designs can be a bottleneck, leading to poor performance overall.
RazorII was proposed to solve some of the shortcomings of the original Razor design, S. Das, C. Tokunaga, S. Pant, W.-H. Ma, S. Kalaiselvan, K. Lai, D. Bull, and D. Blaauw, “RazorII: In situ Error Detection and Correction for PVT and SER Tolerance,” IEEE JSCC, vol. 44, no. 1, pp. 32-48, January 2009. In particular, it utilizes even more tightly integrated architectural-level changes for error correction, forgoes the flop and latch configuration in favor of a single latch plus a transition detector, and moves the possible point of metastability from the datapath to the control path. The RazorII flop's primary storage mechanism is a latch, which removes the possibility of metastability occurring on the rising edge of clock. It also “corrects” its output without re-latching as the latch remains transparent for the entire high phase of the clock. During this time, the transition detector monitors the input data and will generate a flag signal when a transition occurs, indicating a timing error. This error signal can subsequently become metastable, as the input data and falling edge of clock can arrive simultaneously. The designers use a standard two-flop synchronizer in an attempt to resolve metastability before it enters the control circuit; however, this may not be a reliable method, as it only accounts for cases when metastability resolves fairly quickly (i.e. within a single cycle). Additionally, it enforces a one-cycle delay on error detection, which may further complicate the correction algorithm and circuitry. Unlike the original Razor, multiple pipeline stages may need to be flushed and the instruction may need to be replayed multiple times, occasionally at half the original system clock rate, until the error is resolved, potentially limiting the potential benefits of the RazorII system. Hold times can also be problematic, as the combinational logic delay between stages may need to be at least as long as the high phase of the clock to ensure new data does not race through the latch-based design.
More recently, Razor Lite has attempted to address the overhead and hold time issues of RazorII by integrating the transition detection more directly into a typical flop-flop design and reducing the timing detection window by reducing the duty cycle of the clock, S. Kim, I. Kwon, D. Fick, M. Kim, Y.-P. Chen, and D. Sylvester, “Razor-lite: A side-channel error-detection register for timing-margin recovery in 45 nm soi cmos,” in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2013 IEEE International, February 2013, pp. 264-265. However, it may still require tight architectural integration for the replay and correction mechanism, may suffer from metastability, and may incur high performance penalties when an error occurs.
--Timber
TIMBER is similar to Razor-II in that it primarily uses latches to avoid metastability in the datapath, M. Choudhury, V. Chandra, K. Mohanram, and R. Aitken, “Timber: Time Borrowing and Error Relaying for Online Timing Error Resilience,” in DATE, March 2010, pp. 1554-1559. However, the time-borrowing nature of latches is exploited to allow error correction across multiple stages. For example, an error occurring in stage 1 may be resolved as it propagates through non-critical paths in stage 2, thereby preventing an error from being flagged in stage 2. In the case when an error may extend across multiple stages, a global error detection circuit may temporarily slow the clock to until the error is resolved. However, this design may still requires architectural changes to adjust the clock frequency, which in many designs may not be scaled on a cycle-by-cycle bases as proposed. Additionally, the authors may be incorrectly assuming that using a latch-based datapath prevents metastability in the control path as well as the datapath. They may not filter or attempt to resolve metastability issues in their global control circuit, which can lead to low mean-time-between-failures (MTBF).
--Bubble Razor
Bubble Razor (BR) inherits the features of previous Razor techniques enabling real-time error detection and correction, M. Fojtik, D. Fick, Y. Kim, N. Pinckney, D. Harris, D. Blaauw, and D. Sylvester, “Bubble Razor: Eliminating Timing Margins in an ARM cortex-M3 Processor in 45 nm CMOS Using Architecturally Independent Error Detection and Correction,” IEEE JSCC, vol. 48, no. 1, pp. 66-81, January 2013; M. Fojtik, D. Fick, Y. Kim, N. Pinckney, D. Harris, D. Blaauw, and D. Sylvester, “Bubble Razor: An Architecture-Independent Approach to Timing-error Detection and Correction,” in Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2012 IEEE International, February 2012, pp. 488-490. Unlike other Razor architecture, it is based on a two-phase latch-based design, in which each traditional flip-flop is replaced with two latches that undergo retiming to have approximately equal amount of logic between each latch. It uses a bubble propagation algorithm that makes the approach applicable to any architecture and enables the automatic application of this technique to legacy flip-flop based RTL designs, significantly reducing barriers to adoption.
Bubble Razor flags a timing violation when the data arriving at a latch varies after the latch opens using an error detecting latch (EDL). Upon detecting a timing violation, the circuit may automatically recover by stalling the subsequent latch, giving it an additional clock cycle to process the data. Half of the additional clock cycle is used to compensate for the unexpectedly large delay from the previous latch and the other half accounts for the delay from the current latch to the subsequent one. Thus, timing violations may be corrected, as long as the real delay of each half clock-cycle step never exceeds one clock cycle of time. However, to ensure correct operation, stalling the subsequent latch may not be sufficient. Upstream stages may need to be stalled to ensure valid data is not overrun and downstream stages must be stalled to ensure corrupt data is not accidentally interpreted as valid.
The latch-based scheme in BR enables an automatic local stall propagation algorithm without modifying the original RTL design. Consider the 2-stage ring in FIG. 1(a) that has 4 latches with associated clock gating logic that implements the stall propagation algorithm. A timing violation may cause an error signal to be sent to its Right Neighbor (RN) to tell it to stall. Then, the stalling may spread both forward and backward directions around the ring in a wave-like pattern. For example, in FIG. 1, the timing violation occurs in latch 2 and this may trigger a stall in latch 3. The clock gating logic for latch 3 then spreads the stall forward to stage 4 and backward to latch 2. Clock gating logic that receives stalls from both directions terminates the spreading of stalls. This is called stall annihilation. For example, in FIG. 1(b), the stall is terminated by the clock gating logic of latch 1 because it receives stalls from both of its neighbors, i.e., latches 2 and 4.
Unlike other Razor schemes, one significant weakness of Bubble Razor may be that it does not consider the impact of metastability in the error detecting logic. As the shadow latch closes at a time when errors are expected to happen at some frequency, metastability at the output of the shadow latch may occur. The metastable state may propagate through the error detection logic (XOR followed by a dynamic OR gate). If this state persists for longer than half a clock cycle, it may be latched into the control logic resulting in a system failure. This oversight can significantly reduce the mean time before failure for many applications.