There is a continuing need for digital circuits and systems which are high-speed, robust (i.e. error-free under all possible operating conditions regardless of the fabrication process used and variations thereof), and have low power dissipation. In recent years, this need has become stronger due to the increasing demand for portable electronic devices to have longer battery lives, increased functionality/intelligence within a given power budget, and operational robustness/accuracy. Examples of such portable electronic devices include cellular phones, notebooks, audio players, smart cards, network sensors, bio-medical devices, security and military devices, etc.
The EMI (Electromagnetic Interference) of electronic devices is also an important design issue. Virtually all electronic devices have to meet certain electromagnetic compatibility (EMC) standards before they can be marketed. Furthermore, some security and military applications, for example cryptography applications, require ultra low Electromagnetic Interference (EMI) as EMI is one of the common information used by hackers to decipher security data present in these applications.
Therefore, digital circuits and systems having simultaneously operational robustness, high-speed, low power dissipation and low EMI attributes are highly desirable in the manufacture of electronic devices for today's applications. However, digital circuits and systems operating at high speeds are switching fast and hence, their power dissipation and EMI tend to be higher. To date, design techniques attempting to overcome this have been developed but the performance of these techniques remains unsatisfactory. Such design techniques can be broadly categorized into synchronous-logic-based techniques and asynchronous-logic-based techniques as described below.
Synchronous-Logic-Based Techniques
Since the Moore's law was conceptualized in 1965, several techniques aiming to achieve digital circuits and systems with high speeds and low power dissipation have been developed based on the synchronous-logic design methodology in which a global clock signal (or its variants) is used to synchronize digital operations. Details of synchronous-logic design methodology can be found in J. M Rabaey et al. [5].
In particular, one of the key design issues in synchronous-logic design methodology relates to achieving robust operations under the synchronous operational modality where a pre-defined clock timing closure needs to be strictly abided by. More specifically, each digital operation has to be computed and ready within a clock period. To achieve a digital circuit or system which abides by the pre-defined clock timing closure, several clock-relevant timing assumptions under various possible process and operating conditions (generally termed as Process-Voltage-Temperature (PVT) variations) have to be made. The digital circuit or system can only be robust if these timing assumptions hold.
Besides using design methods aiming to reduce switched capacitances and switching activities at different levels (spanning from the system-level down to the layout- or device-layer), current techniques based on the synchronous-logic design methodology also use transistors with smaller feature sizes (achieved with advanced deep submicron or nano-scaled silicon fabrication processes) as this allows the scaling down of the supply voltages. However, it is well-known that PVT variations in digital circuits and systems tend to increase as the feature sizes of transistors in the circuits and systems are scaled downwards. This in turn results in larger electrical variations in the digital circuits and systems, affecting the validity of the timing assumptions.
Table I shows the possible effects of smaller transistor feature sizes on electrical variations in digital circuits. More specifically, Table I is obtained from the International Technology Roadmap for Semiconductors in year 2011 (ITRS-2011) and tabulates possible electrical variations in digital circuits if these circuits are fabricated using current and possible future fabrication processes. The electrical variations in Table I are expressed in terms of the variations in the process parameters (% Process Parameter Uncertainty), variations in the threshold voltage including all sources of such variations (% Vt variability; all sources), variations in the circuit performance e.g. the circuit delay (% Circuit performance variability), variations in the total power consumption (% Circuit total power variability) and variations in the power leakage (% Circuit leakage power variability). As can be seen from Table I, the electrical variations in the digital circuits are expected to increase as the feature sizes of the transistors in the circuits decrease (from 40 nm to 6.3 nm).
TABLE I20112012201320142015. . .2026Fabrication Process40 nm32 nm28 nm24 nm21 nm. . .6.3 nm% Process Parameter11%12%14%15%18%. . .38%Uncertainty% Vt variability;42%42%42%47%47%. . .79%all sources% Circuit performance42%42%42%45%45%. . .60%variability% Circuit total power51%51%51%55%55%. . .81%variability% Circuit leakage power126% 126% 126% 129% 129% . . .148% variability
The possible effects of smaller transistor feature sizes on electrical variations in digital circuits are further illustrated in FIGS. 1(a) and (b). In particular, FIG. 1(a) illustrates the possible soft error rates of two digital circuit types (the inverter and the clocked latch) at nominal supply voltage if these circuit types are fabricated using current and possible future fabrication process technologies. FIG. 1(b) illustrates the possible soft error rates of the clocked latch at different supply voltages VDD if the clocked latch is fabricated using the 16 nm, 22 nm and 32 nm process technologies. More specifically, FIG. 1(b) shows how the soft error rates of each clocked latch fabricated using a different technology are expected to change as the supply voltage VDD is varied within ±10%. The soft error rates shown in FIGS. 1(a)-(b) are also obtained from the ITRS-2011.
To a certain extent, the inverter can be seen as a representative of combinational logic as it is present in virtually all digital circuits and systems, whereas the clocked latch can be seen as a representative of sequential logic as it is one of the critical building blocks for synchronous-logic circuits and systems. From FIG. 1(a), it can be seen that as the feature sizes of the transistors decrease, the error rates for both the clocked latch and the inverter are expected to increase. This can also be seen from FIG. 1(b) which shows the clocked latch fabricated with 16 nm CMOS technology having the highest predicted soft error rates for all supply voltages. FIG. 1(b) also shows that regardless of the fabrication process technology, the error rates of the clocked latch are expected to increase as the supply voltage VDD decreases.
Furthermore, FIG. 1(a) allows a comparison between the error rates of the clocked latch and that of the inverter. The inverter serves as a good circuit type for comparison of error rates, as it is a simple digital circuit and hence, its error rate can be used as the lowest bound for the error rates of digital circuits. From FIG. 1(a), it can be seen that the clocked latch has significantly more operational errors than the inverter. This is probably due to the clock synchronization issues which are present in the clocked latch but not in the inverter. In particular, for the 12 nm process technology which may possibly be available in future, the error rate of the clocked latch can reach above 10%. This can potentially cause difficulties in designing the digital circuit.
Robust operations can only be guaranteed if the PVT variations issues are fully addressed. However, it is difficult to ensure this and thus, “pessimistic” design practices with large safety timing margins are usually adopted for synchronous-logic circuits and systems. Such design practices tend to slow down the operations of the synchronous-logic circuits and systems.
Furthermore, although under a pre-defined clock timing closure (clock skew, setup-time, hold-time, critical-path timing etc.), a synchronous-logic circuit or system could theoretically be clocked to its maximum speed, such a circuit or system is impractical. This is because the clock infrastructure of a synchronous-logic circuit or system is often “power-hungry” i.e. consumes a large amount of power and this amount of power consumed by the clock infrastructure tends to increase as the clock frequency increases. This in turn results in high power dissipation, causing reliability or packaging issues. Furthermore, a synchronous-logic circuit or system clocked at a high speed tends to emit high EMI as a large amount of current is drawn virtually simultaneously during every clock edge. Therefore, the potential of synchronous-logic circuits and systems in achieving high-speed digital operations is limited, as reflected in how clock frequencies of microprocessors have “stalled” at 1 GHz to 3 GHz for several years.
To date, design issues relating to PVT variations, speed, power dissipation and EMI of synchronous-logic digital circuits and systems are only in part addressed. A brief summary of techniques that have been developed to address these issues is provided below.
In particular, example techniques that have been used to alleviate the impact of PVT variations on the robustness of digital circuits and systems include highly controlled but expensive fabrication processes, closed-loop monitoring circuitry and adaptive biasing etc. In general, these techniques attempt to reduce the PVT variations and timing variations of the digital circuits and systems by means of better fabrication technologies and/or intensive statistical timing analyses. An overview of these techniques can be found in references [1] and [10]-[13].
To improve speed, current digital circuits and systems often adopt nano-scaled fabrication methods, together with techniques such as aggressive timing control, parallelism and pipelining, and dynamic logic etc. The premise of these techniques is to reasonably predict the computation times required by the digital operations, and to reduce the delays of these operations as much as possible. A good overview of these techniques can be found in references [5], [8], [9] and [12].
The use of nano-scaled fabrication methods also help to reduce power dissipation. On top of these methods, current digital circuits and systems also often adopt techniques such as dynamic voltage and frequency scaling, clock gating, power gating, multi-threshold control, parallelism and pipelining etc. to further reduce the power dissipation. The premise of these techniques is to reduce operating supply voltages, switching activities, switching frequencies, parasitic capacitance and leakage currents. A good overview of these techniques can be found in references [5] and [14]-[16].
To reduce EMI, techniques such as using careful layout implementations, using clock synthesis, shielding, increasing wire spacing to reduce transmission line effect etc. are often adopted. A good overview of these techniques can be found in references [5] and [20].
Note that although the above-mentioned techniques are largely intended for synchronous-logic circuits and systems, some of the techniques may also be used for hybrid synchronous/asynchronous-logic circuits and systems.
Despite the development of the above techniques, digital circuits and systems based on synchronous-logic design methodology (and those based on hybrid synchronous/asynchronous-logic design methodology) are still unsatisfactory. Due to the large timing variations in circuits and systems fabricated by nano-scaled fabrication processes, it remains challenging to realize synchronous-logic circuits and systems that fully satisfy the timing assumptions. In fact, robust high-speed operations in synchronous-logic circuits and systems would almost never be guaranteed unless the PVT variations issues have been fully addressed. Furthermore, due to their complex clock infrastructure, synchronous-logic circuits and systems still tend to have high power dissipation and high EMI. To alleviate the effects of the PVT variations and the complex clock infrastructure, the speeds of synchronous-logic circuits and systems often have to be compromised.
Asynchronous-Logic-Based Techniques
The asynchronous-logic approach is in some ways advantageous over the synchronous-logic approach as it allows for more design simplicity and operational robustness. This is largely because asynchronous-logic circuits and systems are self-timed i.e. there is no need for a global clock signal for data synchronization. Instead, the asynchronous-logic approach achieves data synchronization by using a set of handshake protocols. Using the asynchronous-logic approach also helps in achieving lower EMI. This is because while synchronous-logic digital operations are synchronized at the same time which can potentially lead to high current spikes (and hence, higher EMI), asynchronous-logic digital operations are distributed across time, resulting in a smaller rate of change in current (and hence lower EMI).
Details of asynchronous-logic circuits and design methodology can be found in J. Sparso et al. [6]. In particular, FIG. 2 shows the categorization of design techniques for implementing digital circuits with these techniques being classified into synchronous-logic-based and asynchronous-logic-based techniques at the highest level, and with the asynchronous-logic-based techniques being further classified according to the class of asynchronous-logic approach they belong to. In general, there are three classes of asynchronous-logic approaches comprising (1) the delay-insensitive approach in the first class, (2) the quasi-delay-insensitive (QDI) and speed-independent approaches in the second class, and (3) the matched-delay approach in the third class. These approaches are elaborated below.
The delay-insensitive approach requires the digital circuits to adhere to a strict delay property. Although the resulting delay-insensitive circuits can operate perfectly even in the presence of gate and/or wire delays, it is difficult to realize such circuits. As a result, delay-insensitive circuits generally comprise only C-Muller circuits. Hence, the delay-insensitive approach is impractical.
The matched-delay approach is in some sense similar to the synchronous-logic approach in that timing assumptions are required and “pessimistic” design practices with large safety timing margins have to be adopted to ensure robust operations. In particular, the matched-delay approach works by placing bounds on wire and/or gate delays so as to match the delay of delay lines to that of associated combinational circuits. However, it is often difficult to achieve a good match between the aforementioned delays due to PVT variations in the digital circuits and systems. Hence, it is difficult to achieve operational robustness in matched-delay circuits without adopting the “pessimistic” design practices.
The speed-independent and QDI approaches are grouped together under one class as they have similar self-detection mechanisms. Theoretically, both speed-independent circuits and QDI circuits can achieve operational robustness even in the presence of gate delays in the circuits. However, the speed-independent approach works based on the assumption that all wire delays are negligible. With current nano-scaled fabrication processes, this is an unrealistic assumption. On the other hand, QDI circuits work by innately detecting computational delays that arise due to different workloads and operating conditions. This helps in accommodating the PVT variations, thereby achieving design simplicity and increasing operational robustness. The only timing assumption in the QDI approach is the “isochronic forks” assumption, that is, branched wires from a wire node are assumed to have the same wire delays. Such a timing assumption can be fulfilled in practice. Therefore, as compared to the other asynchronous-logic approaches, the QDI approach is probably the most suitable approach for today's applications to innately address PVT variations.
Operation of a QDI Circuit
The following provides a brief overview of the operation of a QDI circuit.
A QDI circuit usually uses dual-rail data encoding in which two wires (or rails) are used to encode a data signal. Table II shows this dual-rail data encoding.
TABLE IID.T (first rail)D.F (second rail)Valid ‘0’01Valid ‘1’10Null (‘0’ reset)00Null (‘1’ reset)11
In particular, the first and second rails respectively represent dual-rail data D.T and D.F. When both rails are in the same logic states (either both D.T and D.F are at logic ‘0’ for the ‘0’ reset encoding or both D.T and D.F are at logic ‘1’ for the ‘1’ reset encoding), the data signal the rails encode is considered “null” or in other words, “empty”. Conversely, when the rails are in opposite logic states (i.e. D.T is at logic ‘1’ while D.F is at logic ‘0’, or D.T is at logic ‘0’ while D.F is at logic ‘1’), the data signal is considered “valid”. In particular, D.T at logic ‘1’ and D.F at logic ‘0’ encodes a valid ‘1’ signal, whereas D.T at logic ‘0’ and D.F at logic ‘1’ encodes a valid ‘0’ signal.
Note that in this document, the dual-rail data D.T, D.F are considered “empty” when they are at logic states indicating that the data signal is “empty” (i.e. when D.T=‘0’, D.F=‘0’ for the ‘0’ reset encoding or when D.T=‘1’, D.F=‘1’ for the ‘1’ reset encoding). When any one of the dual-rail data D.T, D.F is asserted indicating either a valid ‘0’ signal or a valid ‘1’ signal (i.e. when D.T is at logic ‘1’ and D.F is at logic ‘0’, or when D.T is at logic ‘0’ and D.F is at logic ‘1’), the dual-rail data D.T, D.F are considered “valid”.
In general, a QDI circuit is configured to receive dual-rail input signals encoding a logic input and provide dual-rail output signals encoding a logic output. The QDI circuit is also configured to operate either in an initialization mode or in an active mode, and in the active mode, is further configured to alternate between a reset state (which the circuit enters after performing a reset operation) and an evaluate state (in which the circuit performs an evaluation operation). Basically, in the initialization mode, a QDI circuit is in a pre-defined condition having the same output signaling as when it is in the reset state in the active mode. The QDI circuit enters the initialization mode only once after a global reset of the system (i.e. after the entire system, including the QDI circuit and other logic gates, is initialized). In the active mode, the QDI circuit is switched from the reset state to the evaluate state upon detection of a valid logic input, and is switched from the evaluate state to the reset state upon detection of an empty logic input. Usually, the alternating of the QDI circuit is not just based on the logic input but is further based on one or more handshake signals. These handshake signals may in turn be based on the logic input and/or output of the QDI circuit, or that of one or more adjoining QDI circuits. Thus, dual rails encoding each data signal in a QDI circuit can be said to not only encode the state of the data signal but also carry timing information to control the alternating of the QDI circuit between the two states. With this, the commencement and completion of operations in QDI circuits can be easily detected.
A more specific description of how a QDI circuit operates is provided below. The QDI circuit may first be initialized by a global reset to the initialization mode. In the initialization mode, the logic input is empty. The QDI circuit remains in the initialization mode until the global reset is released, and thereafter, the QDI circuit enters the active mode. In the active mode, the QDI circuit performs two operations—an evaluation operation in the evaluate state and a reset operation to return to the reset state. Initially (upon the release of the global reset), the QDI circuit is in the reset state. Upon receiving a valid logic input (and when the handshake signal(s) indicate that the QDI circuit is ready for the evaluation operation), the QDI circuit enters the evaluate state and performs the evaluation operation on the valid logic input to produce a valid logic output. When the logic input becomes empty again (and when the handshake signal(s) indicate that the QDI circuit is ready for the reset operation), the reset operation is performed for the QDI circuit to return to the reset state.
Pipeline Structures in which QDI Circuits can be Adopted
As shown in FIG. 2, QDI approaches can be further classified based on the pipeline structures they are applicable to. A pipeline structure generally comprises a Datapath and a Controller, whereby the Datapath allows the flow of data through the pipeline to perform operations and the Controller controls this flow of data.
In general, there are two asynchronous-logic pipeline structures in which QDI circuits can be adopted—the Data-Control Decomposition pipeline structure and the Integrated-Latch pipeline structure. These structures differ from each other in that in the Data-Control Decomposition pipeline structure, the Controller and Datapath are separated whereas in the Integrated-Latch pipeline structure, the Controller and Datapath are integrated. This is elaborated below with reference to FIGS. 3 and 4.
In particular, FIG. 3 shows a block diagram of the Data-Control Decomposition pipeline structure in which the Controller (QDI controller circuit comprising the asynchronous-logic controllers including latches, latch controller and input completion detection circuits (ICD)) is separated from the Datapath (QDI circuits). The logic input is indicated as Input and is in the dual rail format. Upon detecting that the logic input is valid, the circuit of FIG. 3 generates a logic output shown as Output in the dual rail format, and a signal Lack which indicates that the signal is valid. The signal Lack is passed to the cell of the previous pipeline to act as Rack for that cell. The circuit continues to hold the logic output, Output. When a handshake signal Rack is received, it indicates that Output has been consumed by the succeeding pipeline and the circuit can stop holding the logic output, Output. The circuit of FIG. 3 allows the Controller and the Datapath to be designed independently and in turn allows a simpler realization of the pipeline. However, pipelines based on this structure tend to be slow (or speed-inefficient) as the grouping of many QDI circuits together results in a long critical delay path.
Examples of QDI approaches applicable to the Data-Control Decomposition pipeline structure include the Delay-Insensitive Minterm Synthesis (DIMS) approach, NULL Convention Logic (NCL) approach, Pre-charged Static Logic (PSCL) approach and those using a combination of these aforementioned approaches. More details on the Data-Control Decomposition pipeline structure and the QDI realizations for this pipeline structure can be found in references [2], [3], [6], [17] and [18].
In contrast, the Integrated-Latch pipeline structure integrates the Controller and the Datapath by incorporating an asynchronous-logic controller into each QDI circuit (logic cell) to form a micro-cell level pipeline circuit. The resulting QDI circuit may be referred to as an “Integrated-Latch QDI circuit”. FIG. 4 shows an example of such an “Integrated-Latch QDI circuit” with its generic interface signals. The terms Input, Output, Lack and Rack have the same meaning as in FIG. 3. As compared to a pipeline based on the Data-Control Decomposition pipeline structure, a pipeline based on the Integrated-Latch pipeline structure has a shorter critical delay path and therefore, operates faster. In fact, depending on the logic depth within the pipeline, the speed of a pipeline based on the Integrated-Latch pipeline structure can be 10×-100× higher (in terms of throughput rate) than that of a pipeline based on the Data-Control Decomposition pipeline structure. In an Integrated-Latch QDI pipeline, besides detecting the commencement and completion of operations in each QDI circuit, it is also necessary to address the “input completeness” issue and the “gate orphan” issue to preserve the quasi-delay-insensitivity attribute of the pipeline. More specifically, the “input completeness” issue refers to the need for all inputs to each QDI circuit to be acknowledged before a new pipeline operation is commenced, whereas the “gate orphan” issue refers to the need to avoid occurrences of “gate orphans” (a “gate orphan” occurs when an internal gate is enabled to switch its output but this switching is masked from the observable outputs of the entire circuit).
An example QDI approach applicable to the Integrated-Latch pipeline structure is the Pre-Charged Half Buffers (PCHB) approach. FIG. 5 shows a buffer cell implemented based on the PCHB approach. In particular, the buffer cell in FIG. 5 receives dual-rail input signals A.T, A.F, provides dual-rail output signals Q.T, Q.F and operates using the left- and right-channel handshake signals Lack, Rack. Furthermore, the buffer cell comprises an “Input detection” circuit 502 for addressing the “input completeness” issue as mentioned above. In particular, this “Input detection” circuit 502 comprises an OR gate configured to receive the input signals A.T and A.F. Furthermore, the buffer cell in FIG. 5 is designed such that no “gate orphan” is observed. Having addressed the “input completeness” and “gate orphan” issues, the buffer cell can thus achieve robust data synchronization (see references [6] and [7]). Further, the buffer cell in FIG. 5 has a forward latency of two transitions, i.e. a first transition to dis-charge either S.T or S.F to ‘0’, and a corresponding second transition to charge either Q. T or Q.F to ‘1’.
Although PCHB circuits (or cells) are more advantageous than DIMS, NCL, PSCL circuits (or cells) as they are designed to implement the Integrated-Latch pipeline structure, the PCHB cells tend to suffer from large circuit and power overheads. There are other approaches such as the PS0, LP2/1, Single-Track Asynchronous Pulse Logic (STAPL), Single-Track Full Buffer (STFB) and Sense-Amplifier Pass Transistor Logic (SAPTL) approaches that are also applicable to the Integrated-Latch pipeline structure. However, these approaches are not fully QDI as they require further timing assumptions on top of the “isochronic forks” assumption. This is because the circuit realization of these approaches does not fully address the “input completeness” and/or “gate orphan” issues, hence the circuits require some further timing assumptions to achieve conditional error-free operations. Therefore, circuits based on these approaches are not as operationally robust as those based on fully QDI approaches. Further, similar to the PCHB circuit, the circuits for the PS0, LP2/1, STAPL, STFB and SAPTL approaches also have large circuit overheads. More details of the asynchronous-logic Integrated-Latch pipeline structure and the associated QDI realizations can be found in references [2], [4], [7] and [17]-[19].
In view of the above, it can be said that even though the asynchronous-logic approach is in some ways more advantageous than the synchronous-logic approach, the asynchronous-logic approach still suffers from many problems. For example, QDI digital circuits, such as the PCHB circuit, still suffer from high power dissipation (partly due to the dual-rail encoding) and large IC area requirements. Therefore, similar to current design techniques based on the synchronous-logic approach, current design techniques based on the asynchronous-logic approaches, including the QDI approach, are also unsatisfactory in achieving operations which have simultaneously operational robustness, high-speed, low power dissipation and low EMI attributes.