Many information processing devices operate based on a control clock signal to synchronize operations of different processing components and therefore are usually referred to as "synchronous" processing devices. In general, different processing components may operate at different speeds due to various factors including the nature of different functions and different characteristics of the components or properties of the signals processed by the components. Synchronization of these different processing components requires the speed of the control clock signal to accommodate the slowest processing component. Thus, some processing components may complete respective operations earlier than other slow components and have to wait until all processing components complete their operations. Although the speed of a synchronous processor can be improved by increasing the clock speed to a certain extent, synchronous processing is not an efficient way of utilizing available resources.
An alternative approach, pioneered by Alain Martin of California Institute of Technology, eliminates synchronization of different processing components according to a clock signal. Different processing components simply operate as fast as permitted by their structures and operating environments. There is no relationship between a clock speed and the operation speed. This obviates many technical obstacles in a synchronous processor and can be used to construct an "asynchronous" processor with a much simplified architecture and a fast processing speed that are difficult to achieve with synchronous processors.
U.S. Pat. No. 5,752,070 to Martin and Burns discloses such an asynchronous processor, which is incorporated herein by reference in its entirety. This asynchronous processor goes against the conventional wisdom of using a clock to synchronize various components and operations of the processor and operates without a synchronizing clock. The instructions can be executed as fast as the processing circuits allow and the processing speed is essentially limited only by delays cased by gates and interconnections.
Such an asynchronous processor can be optimized for high-speed processing by special pipelining techniques based on unique properties of the asynchronous architecture. Asynchronous pipelining allows multiple instructions to be executed at the same time. This has the effect of executing instructions in a different order than originally intended. An asynchronous processor compensates for this out-of-order execution by maintaining the integrity of the output data without a synchronizing clock signal.
A synchronous processor relies on the control clock signal to indicate when an operation of a component is completed and when the next operation of another component may start. By eliminating such synchronization of a control clock, a pipelined processing component in an asynchronous processor, however, generates a completion signal instead to inform the previous processing component the completion of an operation.
For example, assume P1 and P2 are two adjacent processing components in an asynchronous pipeline. The component P1 receives and processes data X to produce an output Y. The component P2 processes the output Y to produce a result Z. At least two communication channels are formed between P1 and P2: a data channel that sends Y from P1 to P2 and a request/acknowledgment channel by which P2 acknowledges receiving of Y to P1 and requests the next Y from P1. The messages communicated to P1 via the request/acknowledgment channel are produced by P2 according to a completion signal internal to P2.
Generation of this completion signal can introduce an extra delay that degrades the performance of the asynchronous processor. Such extra delay is particularly problematic when operations of a datum are decomposed into two or more concurrent elementary operations on different portions of the datum. Each elementary operation requires a completion signal. The completion signals for all elementary operations are combined into one global completion signal that indicates completion of operations on that datum. Hence, a completion circuit ("completion tree") is needed to collect all elementary completion signals to generate that global completion signal. The complexity of such a completion tree increases with the number of the elementary completion signals.
When not properly implemented, the extra delays of a completion tree can significantly offset the advantages of an asynchronous processor. Therefore, it is desirable to reduce or minimize the delays in a completion tree.