Asynchronous circuits allow flexibility in processing operations. Unlike many clocked processing operations which require operations to be done on a worst-case behavior scenario, asynchronous circuits exhibit average case behavior and thus can be optimized for processing data-dependent operations.
One important type of operation in processors is prefix operations in which a result from one operation is needed to perform another operation. A prefix operation computes results y.sub.1, y.sub.2, . . . , y.sub.N from input channels with input parameters x.sub.1, x.sub.2, . . . , x.sub.N, where y.sub.k =x.sub.1 xx.sub.2 x . . . xx.sub.k (1.ltoreq.k.ltoreq.N). This is described in "Parallel Prefix Computation", by Ladner and Fischer, Journal of the Association for Computing Machinery, vol. 27(4), pp. 831-838 (1980), which is incorporated herein by reference.
Prefix operations are used in various applications. For example, the computation of an arbitrary Mealy machine can be made parallel by the prefix operation. F. Thomson Leighton, "Introduction to Parallel Algorithms and Architectures: Arrays, Trees, Hypercubes", Morgan-Kaufmann, 1992. Many different computation processes can be performed with a prefix computation. This includes linear recurrence relations and computing line of sight targets from the ground given their positions. In particular, an efficient adder for binary addition can be constructed with the prefix operation as described herein.
Communication hardware processes (CHP) can be used to represent the prefix operations in an asynchronous circuit. It is assumed that inputs x.sub.1, x.sub.2, . . . , x.sub.N arrive on input channels X.sub.1, X.sub.2, . . . , X.sub.N, respectively, and that the outputs y.sub.1, y.sub.2, . . . , y.sub.N are to be respectively produced on output channels Y.sub.1, Y.sub.2, . . . , Y.sub.N. Therefore, the prefix operation can be restated in terms of reading the values x.sub.i (i=1, 2, . . . , N) from the input channels X.sub.i, computing the y.sub.i values, and further sending these values on the appropriate output channels.
The above asynchronous prefix operation can be written in a CHP program as follows: EQU *[X.sub.1 ?x.sub.1, X.sub.2 ?x.sub.2, . . . , X.sub.N ?x.sub.N ; Y.sub.1 !x.sub.1, Y.sub.2 !(x.sub.1 xx.sub.2), . . . , Y.sub.N !(x.sub.1 xx.sub.2 x . . . xx.sub.N)] (1)
where notation X?x represents receiving a value on channel X and storing the received value in variable x, notation Y!y represents exporting the value of variable y over channel Y, respectively. A comma "," and a semicolon ";" in Equation (1) indicate parallel and sequential compositions, respectively. The number of x-operations is on the order of N.sup.2, i.e., O(N.sup.2). This corresponds to O(N.sup.2) circuit elements. The prefix operation as in Equation (1) is inefficient, which will become more apparent in the following description.
For convenience of description, the prefix operation e is assumed to have an identity e in the following discussion. This does not affect the generality of description since one can easily augment x to include an identity and use this operator instead of x to perform the desired computation, or a special case of the construction when it is known that one of the inputs is e, as is known in the art.
One prior-art solution to the above prefix operation is a "tree" process. This is illustrated in FIGS. 1a and 1b. It is assumed that a circuit can be constructed to perform a.sub.1 xa.sub.2 x . . . xa.sub.m for m=.left brkt-bot.N/2.right brkt-bot. and m=.left brkt-top.N/2.right brkt-top.. A plurality of such circuits can be used to perform x.sub.1 xx.sub.2 x . . . xx.sub.N by adding a single process that reads in the output of two stages and performs a single x operation since the operation is associative. This process is shown as "UP" process 102 in FIG. 1a, in which two inputs on channels L and R are read and a desired output is generated on channel U: EQU UP(L, R, U).ident.*[L?x, R?y; U!(xxy)]. (2)
The value of x.sub.1 xx.sub.2 x . . . xx.sub.N can be computed using a tree of these UP processes.
For any input x.sub.k, a prefix x.sub.1 xx.sub.2 x . . . xx.sub.k-1 is needed to compute the output y.sub.k. The tree in FIG. 1a shows that an input to UP at a particular node is the prefix of the inputs at the leaves of the left and right subtree of the node. The prefix required by the first (leftmost) node of the right subtree can be computed if the prefix required by the first node of the left subtree is known.
If the prefix can be obtained on another input channel V, the UP process then can be augmented to send the appropriate subtree prefixes back down the tree: EQU UP(L, R, U, V, Ld, Rd).ident.*[L?x, R?y; U!(xxy); V?p; Ld!p, Rd!(pxx)](3)
Channels Ld and Rd provide exactly the inputs needed by the V channel of the children of a particular UP process. Therefore, this collection of processes can perform the prefix operation by providing an input to the root of the prefix computation tree, reading the inputs and producing the final outputs.
The V channel at the root of the tree requires a null prefix, which is the identity e. The output channel U of the root is not used by any other process. The root process can be simplified as: EQU ROOT(L, R, Ld, Rd).ident.*[L?x, R?y; Ld!e, Rd!x] (4)
where e is the identity of x. The leaves of the prefix computation tree read the inputs and the corresponding prefix from the tree, and produce the appropriate output. A LEAF process can be written as: EQU LEAF(X, U, V, Y).ident.*[X?x; U!x; V?y; Y!(yxx)] (5)
FIG. 1b shows a complete solution for the prefix operation for N=4. Since each node in the tree has a constant number of x computations and there are O(N) bounded fanin nodes in the tree, the number of x-computation circuits in the tree is O(N). The depth of the tree is O(log N), the latency and cycle time of this tree process is O(log N).
The above-described tree process can be split into two parts that execute in parallel by sharing the variable x. For example, a local channel C can be added to copy the value of x in the following UP process: EQU UP(L, R, U, V, Ld, Rd).ident.*[L?x, R?y; U!(xxy), C!x].parallel.*[C?c, V?p; Ld!p, Rd!(pxc)].ident.UP2(L, R, U, C).parallel.UP2(V, C, Rd, Ld)(6)
where .parallel. represents parallel composition and UP2 is defined as: EQU UP2(A, B, C, D).ident.*[A?x, B?y; C!(xxy), D!x] (7)
Similarly, the LEAF process can be written as: EQU LEAF(X, U, V, Y).ident.*[X?x; U!x, C!x].parallel.*[C?c, V?y; Y!(yxc)](8)
Using a technique described by Martin, a prefix operation with the tree process can be implemented by using handshaking protocols to replace communications on channels and an encoding technique for making the circuits less sensitive to the delay of arrival of a prefix. See, Martin, "Asynchronous data paths and the design of an asynchronous adder", in Formal Methods in Systems Design, vol. 1(1), pp. 119-139 (1992); and Martin, "Compiling Communicating Processes into Delay-insensitive VLSI circuits", in Distributed Computing, Vol. 1(4), 1986.
Based on Martin's techniques, each input of a circuit can be encoded using a delay-insensitive (unordered) code so that the circuit can function correctly even if the inputs do not arrive at the same time. In such a code, the value of the input changes from a neutral value to a valid value without any intermediate values that are valid. Different valid values are used to encode different inputs. Functions u() and n() are used to denote the validity and neutrality of the code. C is the concurrent assignment of some bits of C such that the result is an appropriate valid value without any intermediate value being valid, and C.arrow-down dbl. is the concurrent assignment of some bits of C such that the result is a neutral value without any intermediate value being neutral. The exact nature of these operations depends on the encoding scheme and operation x.
According to the handshaking protocols for prefix operations, a prefix computation is initiated by the environment by setting the inputs to some valid value. The environment then waits for the outputs to become valid, after which the inputs are reset to a neutral value. The subsequent input is supplied after the outputs are reset to a neutral value. The processes of UP2, LEAF, and ROOT can be expressed with the handshaking protocol for a quasi-delay-insensitive asynchronous circuit: ##EQU1##
One limitation of the above handshaking protocol is that the tree only performs one prefix computation at a time since the circuit needs to wait for the output to become valid before resetting the input. A pipelined operation having simultaneous operations on multiple inputs is difficult to implement.
One prior-art system for pipelining operations is to introduce an additional acknowledge signal for each input and output. With this signal, the environment is able to reset the inputs after receiving an acknowledgment from the circuit and can send the next input after the acknowledgment signal has been reset. Thus, the circuit does not need to wait for the output to become valid. In this scheme, UP2, LEAF, and ROOT can be rewritten with the handshaking protocol as: ##EQU2## wherein represents negation and the signals labeled with "a" are the acknowledgment signals for the channels. Of course, other implementations of a pipelined operation are also possible.
Buffering on a channel connecting two halves of UP (and LEAF) is needed for the prefix operation tree to allow for simultaneously performing pipelined O(log N) operations. The buffering is proportional to the depth of the node in the tree. FIG. 2a "unfolds" the prefix operation tree of FIG. 1b to illustrate the up-going phase and the down-going phase. The vertical arrows indicate the internal channels C. For a node of d steps away from the root, (2d-1) stages of buffering on the internal channel C are needed to allow (2 log N+1) simultaneous prefix operations. FIG. 2b shows the tree with the appropriate buffers.
The throughput, which is the number of operations per second, of the pipelined prefix computation with buffers, depends on the time to perform the x operation rather than the number of inputs. The time for the output to be produced when the pipeline is empty is the latency of the computation block which is proportional to the number of stages. Therefore, the latency of the prefix operation tree is (2 log N+1) both with or without the buffers.
Another prior-art technique for performing the prefix operation is illustrated by a circuit in FIG. 3. This circuit performs the prefix operation in a bit-serial manner. For N different input channels, N processing stages are used with one stage for each input channel. All processing stages are linearly connected with one another. A stage for x.sub.k receives y.sub.k-1 on channel L from the previous stage for x.sub.k-1 and kth input parameter x.sub.k on channel X.sub.k therein. The stage for x.sub.k operates to produce output y.sub.k on both channel Y.sub.k and a channel R that connects to the next stage. This prefix operation can be expressed in a communication hardware process by: EQU SERIAL(X, Y, L, R).ident.*[X?x, L?p; Y!(pxx), R!(pxx)]. (15)
The above linear operation has a latency of O(N) which is worse than the previously-described prefix operation tree method.
The latency of the serial prefix computation may be improved by employing a special property of the operation x so that the outputs on channels Y and R can be produced before the input on the channel L is received. This reduces the latency down to O(log N) on average, the same as the latency of the prefix operation tree.
For example, the operator x has a property xxa=b for all values of x. If the input on the channel X is a, then the outputs on Y and R are equal to b, which can be produced without input on channel L. Thus, the SERIAL process can be written as follows: ##EQU3## Where represents waiting until one of the guards separated thereby becomes true and then executing the statements indicated by the arrow associated with the true guard. The time for this process to produce the output is data-dependent. In the best case, the time from receiving the inputs to producing the output is constant, much better than the prefix computation tree. In the worst case, the time is O(N), much slower than the latency of O(log N) of the prefix computation tree. However on average, the latency of the operations is o(log N)as is known in the art.