1. Field of the Invention
This invention relates circuits and methods for asynchronous pipeline processing, and more particularly to pipelines providing high buffering and high throughput.
2. Background of the Related Art
There has been increasing demand for pipeline designs capable of multi-GigaHertz throughputs. Several novel synchronous pipelines have been developed for these high-speed applications. For example, in wave pipelining, multiple waves of data are propagated between two latches. However, this approach requires significant design effort, from the architectural level down to the layout level, for accurate balancing of path delays (including data-dependent delays), yet such systems remain highly vulnerable to process, temperature and voltage variations. Other aggressive synchronous approaches include clock-delayed domino, skew-tolerant domino, and self-resetting circuits. These approaches require complex timing constraints and lack elasticity. Moreover, high-speed global clock distribution for these circuits remains a major challenge.
Asynchronous design, which replaces global clocking with local handshaking, has the potential to make high speed design more feasible. Asynchronous pipelines avoid the issues related to the distribution of a high-speed clock, e.g., wasteful clock power and management of clock skew. Moreover, the absence of a global clock imparts a natural elasticity to the pipeline since the number of data items in the pipeline is allowed to vary over time. Finally, the inherent flexibility of asynchronous components allows the pipeline to interface with varied environments operating at different rates; thus, asynchronous pipeline styles are useful for the design of system-on-a-chip.
One prior art pipeline is Williams' PS0 dual-rail asynchronous pipeline (T. Williams, Self-Timed Rings and Their Application to Division, Ph.D. Thesis, Stanford University, June 1991; T. Williams et al., “A Zero-Overhead Self Timed 160ns 54b CMOS Divider, IEEE JSSC, 26(11):1651-1661, November 1991). FIG. 1 illustrates Williams' PS0 pipeline 10. Each pipeline stage 12a, 12b, 12c is composed of a dual-rail function block 14a, 14b, 14c and a completion detector 16a, 16b, 16c. The completion detectors indicate validity or absence of data at the outputs of the associated function block.
Each function block 14a, 14b, 14c is implemented using dynamic logic. A precharge/evaluate control input, PC, of each stage is tied to the output of the next stage's completion detector. For example, the precharge/evaluate control input, PC, of stage 12a is tied to the completion detector 16b of stage 12b and is passed to function block 14a on line 18a. Since a precharge logic block can hold its data outputs even when its inputs are reset, it also provides the functionality of an implicit latch. Therefore, a PS0 stage has no explicit latch. FIG. 2(a) illustrates how a dual-rail AND gate, for example, would be implemented in dynamic logic; the dual-rail pair, f1 and f0, implements the AND of the dual-rail inputs a1a0 and b1b0.
The completion detector 16a, 16b, 16c at each stage 12a, 12b, 12c, respectively, signals the completion of every computation and precharge. Validity, or non-validity, of data outputs is checked by OR'ing the two rails for each individual bit, and then using a C-element to combine all the results (See, FIG. 2(a)). A C-element is a basic asynchronous stateholding element. More particularly, the output of an n-input C-element is high when all inputs are high, is low when all inputs are low, and otherwise holds its previous value. It is typically implemented by a CMOS gate with a series stack in both pull-up and pull-down, and an inverter on the output (with weak feedback inverter attached to maintain state).
The sequencing of pipeline control for the Williams' PS0 dual-rail pipeline is as follows: Stage N is precharged when stage N+1 finishes evaluation. Stage N evaluates when stage N+1 finishes reset. Actual evaluation will commence only after valid data inputs have also been received from stage N−1. This protocol ensures that consecutive data tokens are always separated by reset tokens or spacers.
The complete cycle of events for a pipeline stage is derived by observing how a single data token flows through an initially empty pipeline. The sequence of events from one evaluation by stage 12a, to the next is: (i) Stage 12a evaluates, then (ii) stage 12b evaluates, then (iii) stage 12b's completion detector 16b detects completion of evaluation, and then (iv) stage 12a precharges. At the same time, after completing step (ii), (iii)′ stage 12c evaluates, then (iv)′ stage 12c's completion detector 16c detects completion of evaluation, and initiates the precharge of stage 12b, then (v) stage 12b precharges, and finally, (vi) stage 12b's completion detector 16b detects completion of precharge, thereby releasing the precharge of stage 12a and enabling stage 12a to evaluate once again. Thus, there are six events in the complete cycle for a stage, from one evaluation to the next.
The complete cycle for a pipeline stage, traced above, consists of 3 evaluations, 2 completion detections and 1 precharge. The analytical pipeline cycle time, TPS0, therefore is:TPS0=3·tEval+2·tCD+tPrech  (1)                where, tEval and tPrech, are the evaluation and precharge times for each stage, and tCD is the delay through each completion detector.        
The per-stage forward latency, L, is defined as the time it takes the first data token, in an initially empty pipeline, to travel from the output of one stage to the output of the next stage. For PS0, the forward latency is simply the evaluation delay of a stage:LPS0=tEva  (2)
A disadvantage of this type of latch-free asynchronous dynamic pipelines (e.g., PS0), is that alternating stages usually must contain “spacers,” or “reset tokens,” limiting the pipeline capacity to 50%. Another disadvantage of the Williams pipeline is that it requires a number of synchronization points between stages. Moreover, William's maintains data integrity by constraining the interaction of pipeline stages, i.e., the precharge and evaluation of a stage are synchronized with specific events in neighboring stages.
Three recent, competitive asynchronous pipelines provide improved performance but suffer from numerous disadvantages which have been removed by the digital signal processing pipeline apparatus in accordance with the invention.
A design by Renaudin provides high storage capacity (M. Renaudin et al. “New Asynchronous Pipeline Scheme: Application to the Design of a Self-Timed Ring Divider, IEEE JSSC, 31(7): 1001-1013, July 1996). Renaudin's pipelines achieve 100% capacity without extra latches or “identity stages.” Their approach locally manipulates the internal structure of the dynamic gate in order to provide increased capacity.
However, there are two significant disadvantages of Renaudin's pipelines. First, in Renaudin's pipelines, extra latching is achieved by modifying the output inverter of each dynamic gate into a gated inverter, through the use of additional transistors. A second disadvantage of Renaudin's pipelines is a relatively low throughput. In particular, Renaudin's pipelines are based on a much more conservative form of PS0 pipelines, called PC0. Consequently, their throughput, while an improvement over PC0, is worse than even that of PS0.
The two FIFO designs by Molnar et al.—the asp* FIFO and the micropipelined FIFO—are among the most competitive pipelines presented in literature, with reported throughputs of 1.1 Giga and 1.7 Giga items/second in 0.6 μm CMOS (C. Molnar et al., “Two FIFO Ring Performance Experiments,” Proceedings of the IEEE, 87(2):297-307, February 1999).
Molnar's first FIFO, asp*, has significant drawbacks. When processing logic is added to the pipeline stages, the throughput of the asp* FIFO is expected to significantly degrade relative to the pipeline designs described herein. This performance loss occurs because the asp* FIFO requires explicit latches to separate logic blocks. The latches are essential to the design; they ensure that the protocol will not result in data overruns. As a result, in asp*, with combinational logic distinct from latches, the penalty of logic processing can be significant. In addition, the asp* FIFO has complex timing assumptions which have not been explicitly formalized; in fact, an early version was unstable due to timing issues.
Molnar's second design, the micropipelined FIFO, also has several shortcomings. First, the micropipeline is actually composed of two parallel “half-rate” FIFO's, each providing only half of the total throughput (0.85 Giga items/second); thus, the net throughput of 1.7 Giga items/second is achieved only at a significant cost in area. Second, the micropipelined uses very expensive transition latches. Another limitation of the micropipelined FIFO is that it cannot perform logic processing at all; i.e., it can only be used as a FIFO. The reason for this restriction is that it uses a complex latch structure in which parts of each latch are shared between adjacent stages. As a result, insertion of logic blocks between latches is not possible.
Among the fastest designs reported in literature are the IPCMOS pipelines, with throughputs of 3.3-4.5 GHz in a 0.18 μm CMOS process (S. Shuster et al., “Asynchronous Interlocked Pipelined CMOS Circuits Operating at 3.3-4.5 GHz, Proceedings ISSCC, February 2000). IPCMOS has disadvantages at the circuit as well as at the protocol levels. First, IPCMOS uses large and complex control circuits which have significant delays. Second, IPCMOS makes use of extremely aggressive circuit techniques, which require a significant effort of design and verification. For example, one of the gates in their “strobe” circuit potentially may have a short circuit through its pull-up and pull-down stacks, depending on the relative arrival times of inputs to the two stacks from multiple data streams. Their approach relies on a ratioing of the stacks to ensure correct output. Third, in IPCMOS, pipeline stages are enabled for evaluation only after the arrival of valid data inputs. Hence, the forward latency of a stage is poor, because of the delay to precharge-release the stage.
It is an object of the invention to provide high throughput and high storage capacity through decoupling the controls of precharge and evaluation. It is another object to reduce the need for a “reset” spacer between adjacent data tokens to increase storage capacity
It is an object of the invention to provide an asynchronous pipeline having protocols wherein no explicit latches are required.
It is an object of the invention to provide an asynchronous pipeline having simple one-sided timing constraints, which may be easily satisfied.
It is an object of the invention to provide an asynchronous pipeline having function blocks that may be enabled for evaluation before the arrival of data Thus, data insertion in an empty pipeline can ripple through each stage in succession.
It is a further object to provide an asynchronous pipeline having high data integrity, wherein a stage may hold its outputs stable irrespective of any changes in its inputs.
It is yet another object of the invention to provide an asynchronous pipeline having reduced critical delays, smaller chip area, lower power consumption, and simple, small and fast control circuits to reduce overhead.
It is another object of the invention to provide an asynchronous pipeline capable of merging multiple input data streams.