1. Field of the Invention
This invention relates to asynchronous pipelines, and more particularly to latchless dynamic asynchronous digital pipelines providing high buffering and high throughput.
2. Background of the Related Art
There has been increasing demand for pipeline designs capable of multi-GigaHertz throughputs. Several novel synchronous pipelines have been developed for these high-speed applications. For example, in wave pipelining, multiple waves of data are propagated between two latches. However, this approach requires significant design effort, from the architectural level down to the layout level, for accurate balancing of path delays (including data-dependent delays), yet such systems remain highly vulnerable to process, temperature and voltage variations. Other aggressive synchronous approaches include clock-delayed domino, skew-tolerant domino, and self-resetting circuits. These approaches require complex timing constraints and lack elasticity. Moreover, high-speed global clock distribution for these circuits remains a major challenge. (See, e.g., xe2x80x9cMotorola and Theseus Logic to jointly develop clockless ICsxe2x80x9d. http://motorola.com/SPS/MCORE/pressxe2x80x9419oct99.htm1, October 1999, which is incorporated by reference in its entirety herein.)
Asynchronous design, which replaces global clocking with local handshaking, has the potential to make high speed design more feasible. (See C. H. van Berkel et al., xe2x80x9cScanning the Technology: Applications of Asynchronous Circuits,xe2x80x9d Proceedings of the IEEE, 87(2):223-233, February 1999, which is incorporated by reference in its entirety herein.) Asynchronous pipelines avoid the issues related to the distribution of a high-speed clock, e.g., wasteful clock power and management of clock skew. Moreover, the absence of a global clock imparts a natural elasticity to the pipeline since the number of data items in the pipeline is allowed to vary over time. Finally, the inherent flexibility of asynchronous components allows the pipeline to interface with varied environments operating at different rates; thus, asynchronous pipeline styles are useful for the design of system-on-a-chip.
Asynchronous design has also demonstrated a potential for lower power consumption and lower electromagnetic noise emission. Recent successes include a fully asynchronous 80C51 microcontroller developed by Philips for use in its commercial pagers and cell phones (as described in Hans van Gageldonk et al., xe2x80x9cAn Asynchronous Low-Power 80C51 Microcontroller,xe2x80x9d Proc. Intl. Symp. Adv. Res. Async. Circ. Syst. (ASYNC), pp. 96-107, 1998, which is incorporated by reference in its entirety herein), and the AMULET3 asynchronous microprocessor developed at the University of Manchester for use in a commercial telecom product (As described in J. D. Garside et al., xe2x80x9cAMULET3ixe2x80x94An Asynchronous System-On-Chip,xe2x80x9d Proc. Intl. Symp. Adv. Res. Async. Circ. Syst. (ASYNC), pp. 162-175, April 2000, which is incorporated by reference in its entirety herein).
One prior art pipeline is Williams"" PS0 dual-rail asynchronous pipeline (As described in T. Williams, Self-Timed Rings and Their Application to Division, Ph.D. Thesis, Stanford University, June 1991; T. Williams et al., xe2x80x9cA Zero-Overhead Self Timed 160ns 54b CMOS Divider,xe2x80x9d IEEE JSSC, 26(11):1651-1661, Nov. 1991; T. Williams, xe2x80x9cAnalyzing and Improving the Latency and Throughput Performance of Self-timed Pipelines and Rings,xe2x80x9d Proc. International Symposium on Circuits and Systems, May 1992; and T. Williams, xe2x80x9cPerformance of Iterative Computation in Self-Timed Rings,xe2x80x9d Journal of VLSI Signal Processing, 7(xc2xd):17-31, February 1994, each of which is incorporated by reference in its entirety herein.). FIG. 1 illustrates Williams"" PS0 pipeline 10. Each pipeline stage 12a, 12b, 12c comprises a dual-rail function block 14a, 14b, 14c and a completion detector 16a, 16b, 16c. The completion detectors 16a, 16b, 16c indicate validity or absence of data at the outputs of the associated function block 14a, 14b, 14c, respectively.
xe2x80x9cDual-railxe2x80x9d is a commonly-used scheme to implement an asynchronous datapath (See, e.g., M. Josephs et al., xe2x80x9cModeling and Design of Asynchronous Circuits,xe2x80x9d Proceedings of the IEEE, 87(2):234-242, February 1999; and C. Seitz, xe2x80x9cSystem timing,xe2x80x9d in Introduction to VLSI Systems, Chapter 7, (Carver A. Mead et al., eds., 1980), which are incorporated by reference in their entirety herein.) In dual-rail design, two wires (or rails) are used to implement each bit. The wires indicate both the value of the bit, and its validity. The encodings of 01 and 10 correspond to valid data values 0 and 1, respectively. The encoding 00 indicates the reset or spacer state with no valid data, and 11 is an unused (illegal) encoding. Encodings on the datapath typically alternate between valid values and the reset state. Since the datapath itself indicates the validity of each bit, dual-rail is effective in designing asynchronous datapaths which are highly robust in the presence of arbitrary delays. In the exemplary embodiment, stage 12a, 12b, 12c receives dual-rail input 13a, 13b, 13c and provides dual-rail output 15a, 15b, 15c, respectively. Dual-rail output 15a of stage 12a passes data to dual-rail input 13b of stage 12b. 
Each function block 14a, 14b, 14c is implemented using dynamic logic. A precharge/evaluate control input (PC) of each stage is tied to the output of the next stage""s completion detector. For example, the precharge/evaluate control input (PC), of stage 12a is tied to the completion detector 16b of stage 12b and is passed to function block 14a on line 18a. (Similarly, the precharge/evaluate control input (PC) of stage 12b is tied to the completion detector 16c of stage 12c and is passed to function block 14b on line 18b.) A precharge logic block can hold its data outputs even when its inputs are reset, it also provides the functionality of an implicit latch. Therefore, a stage 12a, 12b, 12c has no explicit latch. FIG. 2 illustrates function block 14b. Although function blocks 14a and 14c are not illustrated, they are substantially identical to function block 14b, as is known in the art. FIG. 2 illustrates how a dual-rail AND gate, for example, would be implemented in dynamic logic; the dual-rail output 15b (f1 and f0) implements the AND of the dual-rail inputs 13b (a1a0 and b1b0).
The completion detector 16a, 16b, 16c at each stage 12a, 12b, 12c, respectively, signals the completion of every computation and precharge. An exemplary completion detector 16b is illustrated in FIGS. 3(a)-3(b). As illustrated in FIG. 3(a), a C-element 17b to combine all the results (See, FIG. 3). (Further details of the C-element are described in I. E. Sutherland. Micropipelines. Communications of the ACM, 32(6):720-738, June 1989, which is incorporated by reference in its entirety herein.). A C-element is a basic asynchronous stateholding element. More particularly, the output of an n-input C-element is high when all inputs are high, and is low when all inputs are low. If the inputs are not all high or all low, the C-element holds its previous value. It is typically implemented by a CMOS gate with an N-input series stack in both pull-up and pull-down, and an inverter on the output (with weak feedback inverter attached to maintain state). As illustrated in FIG. 3(b), the validity, or non-validity, of the data outputs 15b is checked by OR""ing the two rails for each individual bit using OR elements 17b, and then using the C-element 19b to combine all the results to create the done signal 18a. 
The sequencing of pipeline control for the Williams"" PSO dual-rail pipeline is as follows: Stage N is precharged when stage N+1 finishes evaluation. Stage N evaluates when stage N+1 finishes precharge. Actual evaluation will commence only after valid data inputs have also been received from stage Nxe2x88x921. This protocol ensures that consecutive data tokens are always separated by reset tokens or spacers, where the data bits in a stage are reset to all 00 values.
The complete cycle of events for a pipeline stage is derived by observing how a single data token flows through an initially empty pipeline. The sequence of events from one evaluation by stage 12a to the next evaluation is: (i) Stage 12a evaluates, then (ii) stage 12b evaluates, then (iii) stage 12b""s completion detector 16b detects completion of evaluation, and then (iv) stage 12a precharges. At the same time, after completing step (ii), (iii)"" stage 12c evaluates, then (iv)"" stage 12c ""s completion detector 16c detects completion of evaluation, and initiates the precharge of stage 12b, then (v) stage 12b precharges, and finally, (vi) stage 12b""s completion detector 16b detects completion of precharge, thereby releasing the precharge of stage 12a and enabling stage 12a to evaluate once again. Thus, there are six events in the complete cycle for a stage, from one evaluation to the next.
The complete cycle for a pipeline stage, traced above, consists of 3 evaluations, 2 completion detections and 1 precharge. The analytical pipeline cycle time, TPS0, therefore is:
TPS0=3xc2x7tEval+2xc2x7tCD+tPrechxe2x80x83xe2x80x83(1)
where, tEval and tPrech are the evaluation and precharge times for each stage, and tCD is the delay through each completion detector.
The per-stage forward latency, L, is defined as the time it takes the first data token, in an initially empty pipeline, to travel from the output of one stage to the output of the next stage. For PS0, the forward latency is simply the evaluation delay of a stage:
LPS0=tEvalxe2x80x83xe2x80x83(2)
A disadvantage of this type of latch-free asynchronous dynamic pipelines (e.g., PS0), is that alternating stages usually must contain xe2x80x9cspacers,xe2x80x9d or xe2x80x9creset tokens,xe2x80x9d limiting the pipeline capacity to 50%. Another disadvantage of the Williams pipeline 10 (PS0) is that it requires a number of synchronization points between stages. Moreover, William""s maintains data integrity by constraining the interaction of pipeline stages, i.e., the precharge and evaluation of a stage are synchronized with specific events in neighboring stages.
Another prior art pipeline design called PA0, is described in T. E. Williams. xe2x80x9cSelf-Timed Rings and their Application to Division,xe2x80x9d Ph.D. thesis, Stanford University, June 1991, which is incorporated by reference in its entirety herein. The PA0 pipeline uses control inputs from two subsequent stages, instead of one. The structure of Williams"" PA0 pipeline 20 is shown in FIG. 4. Each pipeline stage 21a, 21b, 21c has a function block 22a, 22b, 22c, a completion detector 24a, 24b, 24c, and an asymmetric C-element (hereinafter xe2x80x9caCxe2x80x9d element) 26a, 26b, 26c. Each stage 21a, 21b, 21c receives a precharge control input 30a, 30b, 30c (PC) and an evaluate control input 28a, 28b, 28c (EVAL). The completion detector 24a, 24b, 24c produces an output which is the completion signal 32a, 32b, 32c. The aC element 26a, 26b, 26c produces an output 34a, 34b, 34c. The precharge control input 30a (PC) of stage 21a is the completion signal 32b from stage 20b. The evaluate control input 28a (EVAL)of stage 21a is the output 34b of aC element 26b, which is derived from the completion detector 24c of stage 21c. 
The pipeline 20 (PA0) operates as follows. Stage N is driven into evaluation as soon as stage N+1 starts to precharge. For example, stage 21a begins to evaluate once stage 21b starts to precharge. Thus, the pipeline 20 (PA0) allows early evaluation. The xe2x80x9ctrigger signalxe2x80x9d which causes the start of evaluation is EVAL=low. Stage N is precharged when N+1 is done evaluating (PC=high) and N+2 is done precharging (EVAL=high).
This stage""s control is implemented by an aC element 26a, 26b, 26c, shown in FIG. 4, which adds a delay to the cycle time. More particularly, the aC element has two inverters in series in the critical path, e.g., 27a/29a, 27b/29b, 27c/29c. As described above, an early evaluation of stage N is enabled by the de-assertion of the trigger signal 28a, 28b, 28c (EVAL=low), which is an input to the control. In pipeline 20 (PA0), the aC element 26a, 26b, 26c holds this value, and evaluation persists until the desired precharge phase begins. The two inverters in the critical path add four inverter delays to the cycle time, because the critical path of pipeline 20 (PA0) for stage 21a goes through two of these aC elements, i.e., the aC element 26b of stage 21b and the aC element 26c of stage 21c, and therefore through inverters 27b/29b, 27c/29c. 
Three recent, competitive asynchronous pipelines provide improved performance but suffer from numerous disadvantages which have been addressed by the digital signal processing pipeline apparatus in accordance with the invention.
Several variants of William""s dual rail schemes have been proposed. For example, a design by Renaudin provides high storage capacity (as described in M. Renaudin et al. xe2x80x9cNew Asynchronous Pipeline Scheme: Application to the Design of a Self-Timed Ring Divider, IEEE JSSC, 31(7): 1001-1013, July 1996, which is incorporated by reference in its entirety herein). Renaudin""s pipelines achieve 100% capacity without extra latches or xe2x80x9cidentity stages.xe2x80x9d Their approach locally manipulates the internal structure of the dynamic gate in order to provide increased capacity.
However, there are two significant disadvantages of Renaudin""s pipelines. First, in Renaudin""s pipelines, extra latching is achieved by modifying the output inverter of each dynamic gate into a gated inverter, through the use of additional transistors. A second disadvantage of Renaudin""s pipelines is a relatively low throughput. In particular, Renaudin""s pipelines are based on a much more conservative form of PS0 pipelines, referred to as the PC0 pipeline. Consequently, their throughput, while an improvement over the PC0 pipeline, is worse than even that of pipeline 10 (PS0).
In addition to the dual-rail datapaths, described above, single-rail designs are increasingly being used in asynchronous processing due to comparatively reduced area and power overhead. The classic single-rail, asynchronous pipelines introduced by Sutherland are called xe2x80x9cmicropipelines.xe2x80x9d (As described in I. E. Sutherland. Micropipelines. Communications of the ACM, 32(6):720-738, June 1989, which is incorporated by reference in its entirety herein.) This style uses elegant transition-signaling (2-phase) control, but has slow and complex capture-pass latches which limit performance. Several variants of micropipelines have been proposed using alternative latching or control structures.
The two single-rail FIFO designs by Molnar et al.xe2x80x94the asp* FIFO and the micropipelined FIFOxe2x80x94are among the most competitive pipelines presented in literature, with reported throughputs of 1.1 Giga and 1.7 Giga items/second in 0.6 xcexcm CMOS (C. Molnar et al., xe2x80x9cTwo FIFO Ring Performance Experiments,xe2x80x9d Proceedings of the IEEE, 87(2):297-307, February 1999).
Molnar""s first FIFO, asp*, has significant drawbacks. When processing logic is added to the pipeline stages, the throughput of the asp* FIFO is expected to significantly degrade relative to the pipeline designs described herein. This performance loss occurs because the asp* FIFO requires explicit latches to separate logic blocks. The latches are essential to the design; they ensure that the protocol will not result in data overruns. As a result, in asp* FIFO, with combinational logic distinct from latches, the penalty of logic processing can be significant. In addition, the asp* FIFO has complex timing assumptions which have not been explicitly formalized; in fact, an early version was unstable due to timing issues.
Molnar""s second design, the micropipelined FIFO, also has several shortcomings. First, the micropipeline is actually composed of two parallel xe2x80x9chalf-ratexe2x80x9d FIFO""s, each providing only half of the total throughput (0.85 Giga items/second); thus, the net throughput of 1.7 Giga items/second is achieved only at a significant cost in area. Second, the micropipelined uses very expensive transition latches. Finally, a significant limitation of the micropipelined FIFO is that it cannot perform logic processing at all; i.e., it can only be used as a FIFO. The reason for this restriction is that it uses a complex latch structure in which parts of each latch are shared between adjacent stages. As a result, insertion of logic blocks between latches is not possible.
Among the fastest designs reported in literature are the IPCMOS pipelines, with throughputs of 3.3-4.5 GHz in a 0.18 xcexcm CMOS process (S. Shuster et al., xe2x80x9cAsynchronous Interlocked Pipelined CMOS Circuits Operating at 3.3-4.5 GHz, Proceedings ISSCC, February 2000). IPCMOS has disadvantages at the circuit as well as at the protocol levels. First, IPCMOS uses large and complex control circuits which have significant delays. Second, IPCMOS makes use of extremely aggressive circuit techniques, which require a significant effort of design and verification. For example, one of the gates in their xe2x80x9cstrobexe2x80x9d circuit potentially may have a short circuit through its pull-up and pull-down stacks, depending on the relative arrival times of inputs to the two stacks from multiple data streams. Their approach relies on a ratioing of the stacks to ensure correct output. Third, in IPCMOS, pipeline stages are enabled for evaluation only after the arrival of valid data inputs. Hence, the forward latency of a stage is poor, because of the delay to precharge-release the stage.
It is an object of the invention to provide a pipeline having protocols wherein no explicit latches are required.
It is an object of the invention to provide a pipeline having simple one-sided timing constraints, which may be easily satisfied.
It is an object of the invention to provide a pipeline having function blocks that may be enabled for evaluation before the arrival of data. Thus, data can simply ripple through each stage in succession.
It is an object of the invention to provide a pipeline in which a stage receives control signals from the next stage as well as from stages further down the pipeline.
It is an object of the invention to provide a pipeline in which a stage indicates to its previous stage that is about to complete an action is parallel with the completion of the action.
It is yet another object of the invention to provide a pipeline having reduced critical delays, smaller chip area, lower power consumption, and simple, small and fast control circuits to reduce overhead.
These and other objects of the invention which will become apparent with respect to the disclosure herein, are accomplished by a latchless dynamic asynchronous digital pipeline circuit for processing data in an environment comprising a first processing stage, a second processing stage and a third processing stage.
The first processing stage may be enabled to enter a first precharge phase and a first evaluate phase in response to a first precharge control signal and a second precharge control signal. The first precharge phase is enabled by the assertion of the first precharge control signal and the de-assertion of the second precharge control signal. The first evaluate phase is enabled by at least one of the de-assertion of the first precharge control signal and the assertion of the second precharge control signal. The first processing stage has a first data input for receiving the data for processing from the environment and a first data output for receiving the data processed by the first function block upon completion of the first evaluate phase.
The second processing stage is enabled to enter a second precharge phase and a second evaluate phase, and has a second data input for receiving the data for processing from the first data output and a second data output for receiving the data processed by the second function block upon completion of the second evaluate phase.
The second processing stage comprises a second completion generator provides an indication of the presence of the data on the second data output by asserting the first precharge control signal when data is present thereon.
The third processing stage is enabled to enter a third precharge phase and a third evaluate phase, and has a third data input for receiving the data for processing from the second data output and a third data output for receiving the data processed by the third function block upon completion of the third evaluate phase.
The third processing stage comprises a third completion generator providing an indication of the presence of data on the third data output by asserting the second precharge control signal when data is present thereon.
Another latchless dynamic asynchronous digital pipeline circuit for processing data in an environment is provided which provide an early indication the completion of the evaluate phase or the precharge phase of a processing stage. The pipeline circuit comprises a first processing stage and a second processing stage. The first processing stage comprises a first function block enabled to enter a first precharge phase and a first evaluate phase in response to a first precharge control signal, and has a first data input for receiving the data for processing from the environment and a first data output for receiving the data processed by the first function block upon completion of the first evaluate phase.
The second processing stage comprises a second function block enabled to enter a second precharge phase and a second evaluate phase in response to a second precharge control signal, and has a second data input for receiving the data for processing from the first data output and a second data output for broadcasting the data processed by the second function block.
The second processing stage has a completion generator responsive to the second precharge control signal and to the data from the first data output, and configured to provide an indication to the first processing stage of the phase for which the second function block has been enabled in parallel with such enablement.
In accordance with the invention, the objects as described above have been met, and the need in the art for a digital pipeline circuit having high throughput and low latency has been satisfied. Further features of the invention, its nature and various advantages will be more apparent from the accompanying drawings and the following detailed description of illustrative embodiments.