1. Field of the Invention
The present invention relates to electronic pipelines, more specifically, to pipelines in which data is transferred between internal stages asynchronously.
2. The Prior Art
Pipelining is a process by which an operation is separated into stages, where each stage is an independent component of the operation. Each stage of a pipeline operates on data passed to it from the previous stage, and when complete, passes the result to the next stage. Pipelining thus allows independent components of an operation to be performed concurrently, increasing throughput. The simplest example of a pipeline is the first-in/first-out buffer, or FIFO, which is used to transfer data between two independent devices in a way that maintains the independence of the devices. The FIFO stores the output from one device until needed by the next device, assuring that the data is received in the same order that it is generated. The first device stores its output in the FIFO at the rate it is generated, and the second device reads its input from the FIFO when needed, effectively coupling the two devices together with respect to data, but decoupling them with respect to timing. An important application for FIFOs is in the interface between a processor and a disk drive. The processor generates data to be written to the disk in infrequent, but high-speed bursts, and disk drive writes the data at a much slower, albeit steady, rate. Another important application for FIFOs is in data communication, where there is a need to buffer data between two systems that operate at different speeds.
A pipeline can be implemented to operate either synchronously or asynchronously. In synchronous operation, data movement from each stage of the pipeline to the next is controlled by a single global clock. In asynchronous operation, data movement between stages is controlled by a local handshake operation. When a stage has data ready for the next stage, it sends a request to that stage, and when that stage accepts the data, it returns an acknowledgement. There are a number of advantages that asynchronous pipelines have over synchronous pipelines. In a synchronous pipeline, it is assumed that the processing time of every stage is no greater than the time period of the clock. Consequently, the pipeline operates at the speed of the slowest stage, reflecting the worst-case time for the operation to complete. On the other hand, each stage of an asynchronous pipeline transfers data when ready, reflecting the average time for the operation to complete.
In a synchronous pipeline, the fastest speed is obtained when the clock period is as close as possible to the speed of the slowest stage. Careful planning and design are required to provide proper control to each stage. The physical layout of the pipeline must be designed to minimize the difference in time that the clock reaches all stages. The greater the difference, the slower the clock must be to accommodate the difference, resulting in slower operation. In an asynchronous pipeline, data transfer between stages is controlled locally, without dependence upon the other stages.
In a synchronous pipeline, the clock must always be active, so the stages are always active, consuming power. The clock can only be shut off with careful timing and after significant latencies. On the other hand, when there is no input data, an asynchronous pipeline ceases operation completely, consuming no power in typical designs. For high-speed designs that use a large number of pipelines, the power savings can be significant.
The design of an asynchronous pipeline is based on the serial connection of a number of pipeline stages, as shown in the four-stage pipeline 100 of FIG. 1. Each stage 102 includes a registered data path 104 and a stage controller 106. In the simple case of a FIFO, the registered data path 104 reduces to a data register 108. Each data register 108 has a number of identical cells 110, one for each bit in the width of the data, typically an integral power of 2, such as 8, 16, 32, or 64 bits. Unless specifically stated otherwise, all further references in this specification to a data register means a reference to all of the cells of the data register. Data enters the first stage data register 108 through the entry port D.sub.IN and is sequentially transferred from one data register 108 to the next until it is available as output data at the output port D.sub.OUT.
Data moves into and out of a stage 102 under the control of the stage controller 106. The stage controller 106 generates a load signal L to the data register 108, causing the data register 108 to latch the data at its input D.sub.I. A short time later, represented by the propagation delay through the data register 108, the data becomes available at the register output D.sub.O. The data transfer is coordinated by a pair of handshake signals between stage controllers 106 of adjacent stages 102. The signal pair include a request signal R and an acknowledge signal A. Together they form a closed-loop communication between two adjacent stages. Basically, when a stage 102 has data available at its output D.sub.O, it asserts the request signal R. When the succeeding stage has accepted the data, it asserts the acknowledge signal A.
There are two protocols used to transfer data between stages, four-cycle signaling and two-cycle signaling, shown in the timing diagrams of FIGS. 2a and 2b, respectively. In four cycle signaling, a signal is always asserted by the same transition of the signal line, illustrated as a low-to-high transition in FIG. 2a. After the data is available from the sender, the sender changes R from low to high. The receiver acknowledges the data by changing A from low to high. Upon receiving the low-to-high acknowledgement, the sender resets the R to low, and upon seeing R go low, the receiver resets A to low. The term four-cycle refers to the fact that there are four transitions of the two signal lines during a transfer.
In two-cycle signaling, a signal is asserted by any transition of the signal line, called an event. After data is available from the sender, the sender changes the state of R, either from low-to-high or from high-to-low, causing an event on R. The receiver acknowledges receipt of data by changing A to its opposite state, causing an event on A. No other transitions of R and A are involved. Two-cycle signaling is potentially twice as fast as four-cycle signaling, the expense being a more complicated circuit implementation. Thus, for maximum throughput of a pipeline, two-cycle timing is preferred.
A block diagram of a pipeline stage 120 with a two-cycle stage controller 122 is shown in FIG. 3 and the associated timing diagram is shown in FIG. 4. The nomenclature used in the art is that the input request signal is R.sub.I, its associated acknowledge signal is A.sub.I, the output request signal is R.sub.O, and its associated acknowledge signal is A.sub.O. The stage controller has a latch 126 to generate R.sub.O, a latch 128 to generate A.sub.I, and a logic circuit 130 to generate L for the data register 124. A cycle begins by an event on R.sub.I. If there is no data in the data register 124, as indicated by an A.sub.O event in the previous cycle, the R.sub.I event causes L to be asserted, latching the data D.sub.I from the previous stage into the data register 124. The R.sub.I event also causes an event on R.sub.O, signaling to the next stage that the data D.sub.O at the output of the data register 124 is available. The R.sub.O event causes L to be deasserted, which, in turn, causes an event on A.sub.I, indicating to the previous stage that its data D.sub.I has been accepted, and terminating the cycle. Meanwhile, the next stage has acknowledged the R.sub.O event by an event on A.sub.O. The next cycle begins by a R.sub.I event, but this does not cause L to be asserted until the data D.sub.O has been read by the next stage, indicated by an A.sub.O event in the previous cycle.
The throughput of a single stage of a two-cycle asynchronous FIFO is 1/T.sub.CYCLE, where T.sub.CYCLE is the maximum of the delays between consecutive events on R.sub.I, A.sub.I, R.sub.O, and A.sub.O, that is EQU T.sub.CYCLE =max(T.sub.RI, T.sub.AI, T.sub.RO, T.sub.AO)
where T.sub.RI, T.sub.AI, T.sub.RO, T.sub.AO represent the greater of the positive transition time and the negative transition time of each signal. Thus, to increase the throughput of the pipeline, T.sub.CYCLE must be decreased.
In most state-of-the-art pipelines 140, a single stage of which is shown in FIG. 5, the width of the data register 142 can be large, 64, 128, or more bits. For each one of these bits, the data register 142 includes a data register cell 144 to hold the data. Each of these cells 144 must be driven by the load signal L in order to latch in the data. Because of the large number of loads on L, a buffer 146 is needed to provide adequate rise and fall times for the load signal L.sub.B.
The timing diagram for the FIFO of FIG. 5 is shown in FIG. 6. To guarantee that the data D.sub.IN has been successfully loaded into the data register 142, the buffered load signal L.sub.B, rather than the unbuffered load signal L.sub.U, is monitored, and A.sub.I and R.sub.O are issued only after L.sub.B has indicated that data D.sub.IN has been safely latched into the data register 142. This introduces an extra delay T.sub.B to T.sub.CYCLE, which adversely affects the throughput of the system.
A typical FIFO includes a 64-bit-wide register and uses 0.5 micrometer CMOS technology. In this system, T.sub.CYCLE, when excluding T.sub.B, is approximately 3 nanoseconds (ns) which results in a throughput of approximately 333 MHz. The buffer needed to drive a 64-bit register has a delay T.sub.B of approximately 3 ns. This buffer delay essentially doubles T.sub.CYCLE, from 3 ns to 6 ns, resulting in a 50% reduction in throughput from 333 MHz to 167 MHz. This situation will not improve with future submicrometer technologies because wire delays will become a more significant factor in the total delay T.sub.CYCLE and because data paths will increase in width, requiring buffers with greater drive capability and the resulting greater buffer delays. The time needed to load data into a data register will remain at approximately 50% of total cycle time.