The present invention relates to information processing, and more specifically to architecture and operation of asynchronous circuits and processors.
Many information processing devices operate based on a control clock signal to synchronize operations of different processing components and therefore are usually referred to as xe2x80x9csynchronousxe2x80x9d processing devices. In general, different processing components may operate at different speeds due to various factors including the nature of different functions and different characteristics of the components or properties of the signals processed by the components. Synchronization of these different processing components requires the speed of the control clock signal to accommodate the slowest processing component. Thus, some processing components may complete respective operations earlier than other slow components and have to wait until all processing components complete their operations. Although the speed of a synchronous processor can be improved by increasing the clock speed to a certain extent, synchronous processing is not an efficient way of utilizing available resources.
An alternative approach, pioneered by Alain Martin of California Institute of Technology, eliminates synchronization of different processing components according to a clock signal. Different processing components simply operate as fast as permitted by their structures and operating environments. There is no relationship between a clock speed and the operation speed. This obviates many technical obstacles in a synchronous processor and can be used to construct an xe2x80x9casynchronousxe2x80x9d processor with a much simplified architecture and a fast processing speed that are difficult to achieve with synchronous processors.
U.S. Pat. No. 5,752,070 to Martin and Burns discloses such an asynchronous processor, which is incorporated herein by reference in its entirety. This asynchronous processor goes against the conventional wisdom of using a clock to synchronize various components and operations of the processor and operates without a synchronizing clock. The instructions can be executed as fast as the processing circuits allow and the processing speed is essentially limited only by delays cased by gates and interconnections.
Such an asynchronous processor can be optimized for high-speed processing by special pipelining techniques based on unique properties of the asynchronous architecture. Asynchronous pipelining allows multiple instructions to be executed at the same time. This has the effect of executing instructions in a different order than originally intended. An asynchronous processor compensates for this out-of-order execution by maintaining the integrity of the output data. without a synchronizing clock signal.
A synchronous processor relies on the control clock signal to indicate when an operation of a component is completed and when the next operation of another component may start. By eliminating such synchronization of a control clock, a pipelined processing component in an asynchronous processor, however, generates a completion signal instead to inform the previous processing component the completion of an operation.
For example, assume P1 and P2 are two adjacent processing components in an asynchronous pipeline. The component P1 receives and processes data X to produce an output Y. The component P2 processes the output Y to produce a result Z. At least two communication channels are formed between P1 and P2: a data channel that sends Y from P1 to P2 and a request/acknowledgment channel by which P2 acknowledges receiving of Y to P1 and requests the next Y from P1. The messages communicated to P1 via the request/acknowledgment channel are produced by P2 according to a completion signal internal to P2.
Generation of this completion signal can introduce an extra delay that degrades the performance of the asynchronous processor. Such extra delay is particularly problematic when operations of a datum are decomposed into two or more concurrent elementary operations on different portions of the datum. Each elementary operation requires a completion signal. The completion signals for all elementary operations are combined into one global completion signal that indicates completion of operations on that datum. Hence, a completion circuit (xe2x80x9ccompletion treexe2x80x9d) is needed to collect all elementary completion signals to generate that global completion signal. The complexity of such a completion tree increases with the number of the elementary completion signals.
When not properly implemented, the extra delays of a completion tree can significantly offset the advantages of an asynchronous processor. Therefore, it is desirable to reduce or minimize the delays in a completion tree.
The present disclosure provides a pipelined completion tree for asynchronous processors. A high throughput and a low latency can be achieved by decomposing any pipeline unit into an array of simple pipeline blocks. Each block operates only on a small portion of the datapath. Global synchronization between stages, when needed, is implemented by copy trees and slack matching.
More specifically, one way to reduce the delay in the completion tree uses asynchronous pipelining to decompose a long critical cycle in a datapath into two or more short cycles. One or more decoupling buffers may be disposed in the datapath between two pipelined stages. Another way to reduce the delay in the-completion tree is to reduce the delay caused by distribution of a signal to all N bits in an N-bit datapath. Such delay can be significant when N is large. The N-bit datapath can also be partitioned into m small datapaths of n bits (N=mxc3x97n) that are parallel to one another. These m small datapaths can transmit data simultaneously. Accordingly, each N-bit processing stage can also be replaced by m small processing blocks of n bits.
One embodiment of the asynchronous circuit uses the above two techniques to form a pipelined completion tree in each stage to process data without a clock signal. This circuit comprises a first processing stage receiving an input data and producing a first output data, and a second processing stage, connected to communicate with said first processing stage without prior knowledge of delays associated with said first and second processing stages and to receive said first output data to produce an output. Each processing stage includes:
a first register and a second register connected in parallel relative to each other to respectively receive a first portion and a second portion of a received data,
a first logic circuit connected to said first register to produce a first completion signal indicating whether all bits of said first portion of said received data are received by said first register,
a second logic circuit connected to said second register to produce a second completion signal indicating whether all bits of said second portion of said received data are received by said second register,
a third logic circuit connected to receive said first and second completion signals and configured to produce a third completion signal to indicate whether all bits of said first and second portions of said received data are received by said first and second registers,
a first buffer circuit connected between said first logic circuit and the third logic circuit to pipeline said first and third logic circuits, and
a second buffer circuit connected between said second logic circuit and the third logic circuit to pipeline said second and third logic circuits.
These and other aspects and advantages will become more apparent in light the following accompanying drawings, the detailed description, and the appended claims.