For synchronous electronic circuits relying on clocks and timing circuitry, all data is synchronized by a global circuit clock. In between combinational logic blocks, latches (e.g., flip-flops) are inserted, which function to latch the data once per period of the clock hence achieving the synchronization of data and control signals among the different circuit elements. In asynchronous circuits, synchronization is achieved through handshaking protocols that are implemented to assist the various circuit elements with the exchange of data. There are many styles of asynchronous design libraries and flows, and almost each one has a different handshaking mechanism associated with it.
For Handshaking protocols implemented for asynchronous circuits, the handshaking between two asynchronous units exchanging data often starts with the unit where the data is originating from sending a request to the receiver. Typically the request is sent when the data is ready and depending on the protocol this could be part of the data or a separate control signal. The receiver has to then acknowledge the receipt of the data. Then the transmitting module knows that the data has been consumed and can reset its value, in order to be ready to process the next set of data. This Request-Acknowledgement exchange can be performed in several different ways and handshaking protocols can be classified according to the nature of this exchange.
There are two distinct kinds of protocols commonly used for asynchronous circuits, the 2-phase and the 4-phase protocol. In the 4-phase protocol case the sender asserts its request (REQ) to inform the receiving element that it holds valid data on its output. The receiving element will then receive the data when it is ready to consume it and raise the acknowledgment (ACK) signal when it has actually done so. The sender will then reset its REQ signal and after that the receiver will lower its ACK signal. The second pair of transitions could also be used to explicitly identify a data reset phase. The 2-phase protocol only uses two active transitions to complete the communication handshake. Therefore, all transitions of the REQ/ACK signals are used in the same way, whether falling or rising. That means that during the first cycle the sender raises REQ and then the receiver raises ACK to finish the handshake. Instead of resetting the signals before the second communication, the protocol is implemented so that the sender lowers REQ to start the next transfer, and then the receiver lowers ACK to acknowledge the data. The request and acknowledgment signals could be individual signals or they could be implemented across the same wire. The later is also known as single-track communication.
The basic forms described above are for point-to-point communications between two adjacent units and the communication cycle is always initiated by the sender. When the sender initiates the protocol it is considered a push channel, and they are common in pipelined circuits. In other non-pipelined circuits, however, the receiver signals that it is ready first before the sender produces any data. This is known as a pull channel and the initial request is sent by the receiver and in the reverse direction of the data flow. For example an adaptation of the 4-phase protocol described previously for push channels can be used for pull channel communications. The receiver asserts the REQ signal to indicate that it is ready to accept data. When the sender has computed the data and put it on the channel it asserts its ACK signal. The receiver then lowers its REQ signal as soon as it has consumed the data. Finally the sender lowers its ACK signal after it has reset the data and the channel is now ready for the next transmission.
All the examples stated up to this point are examples of point-to-point communications. This means that the sender sends a signal to indicate the presence of data and releases the data when that gets acknowledged. Another quite interesting case is called enclosed communication. It is defined as the case where the REQ signal is asserted and then followed by an entire handshake from the receiver side (meaning the ACK is both asserted and de-asserted), before the REQ signal gets de-asserted. This type of behavior might not make a difference in a typical push pipelined channel, however its usefulness becomes apparent when considering cases where performing sequential actions is desired instead of concurrent actions. Assume that the sender wants to generate data and then there are multiple receivers that are going to operate sequential actions based on this data. The REQ signal can then be asserted to validate the data on the sender side. Then multiple receivers can take turns operating on the data and the REQ signals stays high validating its presence. When the last one of the receivers is done processing the sender can lower the REQ signal and reset the data. Additionally it can also be the case that some or all of these processes operate on the data with some level of concurrency as well.
Data encoding can be another way of classifying asynchronous channels based on the way that the data is encoded on the channel. The way that is closest to typical synchronous designs is called bundled data. In bundled data the data is presented in the form of a bus of single rail wires from the sender to the receiver. This has the benefit that only one wire per signal is only required and that the signals could be generated by single-rail combinational blocks just like those used for synchronous design. However there is no way to identify that the data is valid on the receiver end by just observing the data rails, hence the designer has to make sure that the data is all valid before the REQ signal becomes visible to the receiver. For this reason the REQ path has to be delay matched with the slowest combinational path between sender and receiver and this task is not trivial. Post layout simulation is typically required to ensure the functionality of the circuit.
Another way to encode data on a channel is by making it dual-rail. If the dual-rail signals are reset between transitions it is now easy to verify the presence of the data by the data itself by making sure that at least one of the two wires representing the data has been asserted. In this case an explicit REQ line is not necessary for the data, as a simple OR of the two signals verifies that the data is present. Dual-rail signals can also be grouped together in busses as in bundled data. If there is no explicit REQ like in the bundled-data rails all the individual OR results from each signal has to be combined to generate the global REQ signal for the bus. When one bit is transferred a single gate delay is added to the critical path, but in the later case the impact of such a circuit to the performance of the circuit could be significant since it could amount to several gate delays.
A more generalized for of dual-rail signaling is 1-of-N signaling. Here for every n wires that are used one can transmit log(n) bits. Out of the n wires only one is asserted at a time. This encoding has several benefits. Just like dual rail signaling there is no need for an explicit REQ signal since the presence of data can be extracted from the data itself (again assuming that the data is reset between transmissions). For such wide data paths the signals have to be broken up into smaller groups.
Another classifying characteristic of asynchronous communication channels is the type of timing assumptions that are required to hold for a particular protocol to operate correctly. In terms of the actual design process, the fewer timing assumptions that exist in a design the better, since timing assumptions usually have to verified through simulation that have to be performed both pre- and post-layout. The first timing model is one that all delays both gate and wire are allowed to assume any value, and the circuit is guaranteed to function properly. This model is called delay insensitive (DI), and it is the most robust model for asynchronous circuits.
Another category of circuits are Speed-Independent circuits (SI). In speed independent circuits gates could have arbitrary delays, but wire delays are considered negligible. This makes all forks isochronic, hence the QDI protocol requirement stands by default. With process geometries constantly shrinking though, wire delays become more and more dominant part of a path delay, and this assumption and the real delays need to be determined post-layout and the functionality of the circuit has to be verified again through simulation.
Scalable Delay Insensitive (SDI) is an approach that partitions the design in smaller parts and attempts to bridge the gap between DI and SI through this approach. Within each sub-module the design is performed by bounding the ratio of delays between paths by a constant. It also defines a ratio related to the estimated and observed data on the delays that is also lower and upper bound. The same constant is used as a bound for both expressions. After each individual module is designed, the interconnections at the top level are designed based on DI assumptions.
Asynchronous Designs
PCFB and PCHB: The Pre-Charge Half Buffer (PCHB) and Pre-Charge Full Buffer (PCFB) are two example of a QDI template. Both templates are similar, but PCFB uses an extra internal state variable so that it is able to store one token per stage, and that is why it is called a Full Buffer. On the other hand a PCHB is a half buffer meaning that one token can exist in two adjacent pipeline stages. The templates are designed for fine-grain-pipelining, which implies that each pipeline stage is one gate deep. The data is encoded using 1-of-N encoding and thus there is no explicit request line associated with the data. Each gate has an input completion detection unit and the output also has an output completion detection unit. FIG. 1 includes two views depicting two prior art quasi-delay insensitive (“QDI”) asynchronous circuit templates: (a) a pre-charged half buffer (“PCHB”) 100A, and (b) a pre-charged full buffer (“PCFB”) 100B, respectively.
The function blocks can be designed using dynamic logic (e.g., Domino logic) in order to reduce the size of the circuit. Another interesting property is that the function block can actually evaluate even if not all inputs are present yet. If the function allows it the function block can generate an output with a subset of the inputs and data can propagate forward along the pipeline. However the C-element will not send an acknowledgement to the left environment until all inputs arrive and the output has been generated. That prevents premature acknowledgments from propagating backwards to units that have not even produced data yet. The RCD is used to detect that data has indeed been generated from the function block. In the PCHB when both the LCD and RCD have detected valid data on both input and output the function block gets disabled. When the next stage in the pipeline acknowledges the outputs of the current stage then the function block will be pre-charged to be ready to receive the next set of data.
The LCD and RCD operate on 1-of-n encoded channels. Their operation is performed simply by performing an OR on the two wires. The data is reset to zero during pre-charge, therefore, the presence of data is detected when one of the two wires produces a logic 1. If multiple channels exist the results of the OR from each channel have to be combined together through C-elements to produce the output of the LCD/RCD. Even though this is a simple operation one has to remember that this a fine-grain-pipeline design style. For multi-input gates the control logic quickly becomes a large overhead and as a result these templates are not area efficient. Also even though the cells use dynamic logic for smaller size and better performance, there are several levels of control involved in the critical path. With PCHB being a half-buffer the cycle time involves multiple levels of logic as well as a completion detection unit and a C-element. Its cycle time varies depending on the functional block, but is generally between 14 & 18 transitions. The PCFB is a full buffer version of PCHB. It has the same cycle time as PCHB, so its only benefit would be slack capacity. For this reason the PCFB is not as widely used as the PCHB design style. Even though this yields good overall performance, there are design styles available that have much smaller cycle times.
MOUSETRAP is a recently proposed design style. It is a bundled-data protocol, with 2-phase control and could be used for both very fine-grain and coarser pipeline design. It has a very small cycle time of 5 transitions for a FIFO design and although the cycle time would increase with merges, fork and logic added to it, it still has the potential for very high throughput implementations. FIG. 2 depicts a basic diagram of a FIFO pipeline prior art MOUSETRAP design template 200.
MLD: Multi-Level Domino is another design style that also used bundles of wires, however here the data is encoded using differential encoding. The data path is constructed out of domino-logic gates in order to be more area efficient as well as faster. This also allows the circuit to generate a request to the next stage based on the data itself. A completion detection unit exists for each output and all the validity signals are then combined through an AND gate tree to generate the valid flag for the entire pipeline stage. The style is targeted more towards medium-grain pipelining and several layers of logic and many data paths in parallel are typically used in a single pipeline stage. This yields a small overhead from the addition of the pipeline stage control units and hence an area efficient design.
FIG. 3 depicts a prior art multi-level domino (“MLD”) pipeline asynchronous circuit template 300. Even though there are differences between the variants in terms of the handshaking mechanism of the controllers and the generation of control signals, abstractly the general form of these styles can be illustrated in FIG. 3.
The cycle of a pipeline stage starts with the dynamic logic gates receiving data from the previous stage and evaluating their outputs. When the data propagates to the last stage of gates in the pipeline stage the outputs for the stage are generated and the dual-rail signals are used to validate that all outputs are present. The valid signal is generated for the entire stage and is used as a request to the next stage. It could also be used internally in the stage for isolating the outputs and initiating an early pre-charge of the logic before the final stage. When the next stage acknowledges the data, the stage resets its outputs to all zero so that the valid signal is forced low. The data path is connected normally just as in the case of a synchronous netlist. Any forking or merging between stages is handled by the controller circuits. That can be accomplished by inserting C-elements for the requests of signals reaching a merge and the acknowledgment signals departing a fork. The introduction of such elements might impact the cycle time of a stage, but since the data path is several stage long, this extra delay can be offset by reducing the amount of logic levels in a particular stage.
STFB and SSTFB: Single-Track Full Buffer is a design style for fine-grain pipeline design that uses 1-of-N encoding for the data and also 2-phase single-track handshaking between gates that is embedded in the data. It has been shown to yield very high throughput designs. There are several features of this design style that contribute to its high performance capabilities. Firstly the gates use dynamic logic internally for higher performance and reduced area. Secondly the gates have extremely small forward latency of 2 transitions and a total cycle time of 6 transitions. That is accomplished by embedding the control signals as part of the data path and the use of 2-phase handshaking.
In STFB the sender will receive data and evaluate its output and then immediately tri-state its output. The receiver detects the presence of data and evaluates only when all the data has been received. This is done by properly designing the stacks of NMOS transistors so that all paths to ground use all inputs. When the receiver evaluates its outputs it will actively drive the wires low and then tri-state the inputs. This signals the sender that the data has been consumed and it can evaluate the next set of data. The data is encoded in a 1-of-N fashion therefore for each communication only one wire in the set will transition. This wire is therefore used simultaneously for the data, request and acknowledgment signaling between the two cells. A problem with this template is that the data wires are not actively driven at all times. There are times that both transmitter and receiver will be in tri-state mode, hence the data becomes more susceptible to noise and leakage. Statisizers can be used to help alleviate this problem.
Local and Global Cycle Time: In the absence of a global clock, asynchronous circuit performance are often characterized using different metrics. When characterizing an asynchronous pipeline stage (could be as small as a single cell/gate for micro-pipelines) there are two important metrics to characterize performance. The first one is forward latency (“FL”) and is measured as the time between the arrival of a new token, when the pipeline stage is idle, and the production of valid outputs for the next stage. This is a metric that is only dependant on the internal design of the pipeline stage. The second metric is called the local cycle time (“LCT”), and it is defined as the time between the arrival of a token and the time that the unit has reset itself back to the idle state and is ready to receive the next token. This number is generally affected by the next pipeline stages as well since the handshaking on the right side of the stage defines the time at which the stage can reset its output and proceed to get ready to accept new data. Both metrics are calculated during the design phase in terms of transitions, meaning the number of signal transitions that have to take place for the pipeline stage to move from one state to the next. Even though this is not directly translated into actual time, it is a useful first tool for tradeoff studies, design style comparison and performance estimation.
Once the local cycle time and forward latency is known there are several methods to do a more thorough analysis and find the performance of the entire circuit, and potentially identify the bottlenecks in the system. This is generally a very labor-intensive process that cannot be performed without a tool designed for this purpose, but the basic ideas can be intuitively described using the defined metrics of forward latency and local cycle time. The performance of a circuit is defined as the global cycle time (“GCT”) of the circuit and it is essentially the metric that defines how many transitions it takes the circuit to process a token on average. Ideally the global cycle time is equal to the maximum of the local cycle time and the algorithmic cycle time (“ACT”). The algorithmic cycle time is the maximum for all cycles of the sum of the forward latencies of all the pipeline stages in the cycle divided by the number of tokens (data) that are in the cycle at any time. This is the maximum performance target for a design and the global cycle time cannot be improved beyond this point. However, the design might have a cycle time that is higher than this value, depending on the topology and the number of tokens in the design.
The reason that this might happen is that the performance is defined not only by how fast data can propagate down the pipeline, but how fast the pipeline resets to accept new tokens. The backward latency (“BL”) of a pipeline stage is defined as the difference between the local cycle time and the forward latency and it can be perceived as the time it takes for a bubble—or empty position in the pipeline—to propagate backwards in the pipeline. Alternatively, the backward latency can also be defined as the time it takes a node to complete the handshaking with its neighboring cells and reset itself so that the next token can go through.
The forward and backward latency combined define the performance of a local pipeline stage. However the alignment of the data in the forward direction as well as the alignment of the bubbles in the backward direction is important to guarantee that a given global cycle time is achievable even if both the ACT and LCTs are all smaller than the requested global cycle time. This concept of alignment between the handshakes of the various stages is called Slack Matching, as is described in further detail below.
Due to the fact that asynchronous circuits require a handshaking controller for every pipeline stage, which is used to interface to adjacent pipeline stages, the logic overhead of such circuits is large. Moreover, there is a lack of an automated set of tools that would allow a designer to generate a circuit quickly from a behavioral Hardware Description Language (HDL), just like the ASIC flow that has existed for years for synchronous circuits.