1. Field of the Invention
This invention relates generally to the configuration of running-sum adder networks and, more particularly, to a methodology for systematically selecting, subject to given constraints, elements composing the adder networks to thereby physically realize the adder networks.
2. Description of the Background Art
A running-sum adder physically realizes, for a given sequence of numbers x0,x1, . . . , xMxe2x88x921, the summations       ∑          i      =      0        j    ⁢      x    i  
for each j, with 0xe2x89xa6jxe2x89xa6M. More specifically, the jth running sum of the given sequence is x0+x1+ . . . +xj. If an initial value y is present, the jth running sum is then defined as y+x0+x1+ . . . +xj.
The problem of determining the running-sum is usually encountered when designing network switches/routers, computers or computer networks. A running-sum adder is a device which was devised to compute the running-sum. A running-sum adder can be found in concentrators in which the adder is used as a counter to count the number of active packets prefixed to a switching network. It can also be found in a copy network to help replicating packets, such as disclosed in U.S. Pat. No. 4,813,038 issued to Lee.
The running-sum adder may readily be implemented with finite memory. However, in computers or switching networks with distributed architecture, it is desirable to have a distributed solution as well. Now the problem becomes one of finding an efficient distributed solutionxe2x80x94the meaning of the term xe2x80x9cdistributedxe2x80x9d is discussed below once certain terminology is introduced.
With reference to FIG. 1, generic Mxc3x97M distributed running-sum adder network 100 (or, synonymously, xe2x80x9cadder networkxe2x80x9d) is a network with an array of M external inputs (101) (or simply called inputs when referring to the whole adder network), a number of interconnected primitive elements of unidirectional transmission (usually including addition elements, fan-out elements, and delay elements, as described later), and an array of M external outputs (103) (or outputs when referring to the whole adder network). To facilitate the description, both the M inputs and M outputs of the adder network have been labeled with 0, 1, 2, . . . , and Mxe2x88x921 from top to bottom. The task of the adder network is that, when each input i is given a signal xi (105), then adder network 100 performs the corresponding calculation in a distributed fashion such that all outputs generate the results simultaneously, where each output j outputs a signal yj representing the sum of input signals x0, x1, . . . xj (107). The calculation is distributed in the sense that each primitive element performs its calculation or operation locally (that is, only based on its local input(s) and generates the output signal(s) at its local output(s)) without having to know the information at any other parts of the network.
Conventionally, there are several techniques to implement a running-sum adder network. The techniques differ mainly at the internal connectivity of the primitive elements in the network. When comparing the different implementations, typically two measures are considered, namely: the size, which is the number of adder elements in the network (in some other contexts, the number of fan-out elements is also counted; however, in this immediate context, the cost of a fan-out element is much less than that of an adder element, so fan-out elements need not be accounted for); and the depth, which is the number of stages of the network. More specifically, the number of stages of a network is the maximum number of adder elements or delay elements on any path for a signal to traverse from an input to an output of the adder network. In practice, the depth of the adder network corresponds to the computational time in a parallel computation environment, whereas the size represents the amount of hardware required.
One simple implementation of the running-sum adder networkxe2x80x94referred to as the serial implementationxe2x80x94has the minimum size at the cost of depth. Its size is Mxe2x88x921, but its depth is also Mxe2x88x921. With reference to FIG. 2, there is shown a serial implementation of an 8xc3x978 running-sum adder network with only seven 2xc3x971adder elements (202) and 1xc3x972 fan-out elements (204). In network 200, the running-sum computation is not synchronous. The input of the lower 2xc3x971 adder element (206) receives the computed running-sum from the output of the upper 2xc3x971 adder element (208). The arrangement of network 200 is such that the depth of the implementation increases linearly with the dimension of the adder network, which would incur significant delay for large running-sum adder networks.
On the other hand, it is not difficult to find a network with depth exactly equal to ┌log2M┐. Such a designxe2x80x94referred to as the parallel designxe2x80x94has the minimum possible depth, when the fan-out elements are not counted. However, this recursive construction yields a network of size xcexa9(M┌log2M┐) (where xcexa9 means xe2x80x9cupper-bonded by the order ofxe2x80x9d). With reference to FIG. 3, there is shown network 300 of sample of size 8xc3x978. In this implementation, at the output of stage 1, the running sum of the signal at output j is xj[1]=xj+xjxe2x88x921; at the output of stage 2, the running sum is xj[2]=xj[1]+xjxe2x88x922[1]=xj=xjxe2x88x921+xjxe2x88x922+xjxe2x88x923; thus the signals at the output of the adder network is xj[log2M]=xj+xjxe2x88x921+ . . . +x0.
Another example of a running-sum adder network is disclosed in U.S. Pat. No. 4,813,038 (""038) briefly alluded to earlier. The network of ""038 is also a parallel design of size xcexa9(M┌log2M┐). More specifically, the longest chain of signal progression in this Mxc3x97M network has a length M+log2M. This is a lower bound on the signal propagation delay through the network, regardless of the implementation. Moreover, if the implementation is one that maintains signal synchronization across each stage of elements, the bit time would have to be long enough to cover the signal propagation time through N/2 adders. This implies a bit rate proportional to 1/N times the bit rate allowed by individual element. Accordingly, this severely limits the speed of the network.
Clearly, the serial design is simpler than the parallel one and it does provide significant hardware savings. However, the former incurs considerable computational time because of its serial operations.
The art is devoid of teachings and suggestions whereby the network is a balance in the sense that the network exhibits properties of both small size and low latency by recursive interconnection of smaller adder networks. Thus, a need exists in the art for a systematic procedure to configure a running-sum adder network given the size and depth requirements. As part of this procedure, it is necessary to obtain a tractable, effective solution that gives the desired balanced result.
These shortcomings and other limitations and deficiencies are obviated in accordance with the present invention by a method, and concomitant circuitry, to systematically and efficiently physically realize a running-sum adder network based upon a meritorious balance between size and depth.
In accordance with the broad aspect of the present invention, a method is set forth for physically implementing running-sum adder network. The method includes the steps of (1) systematically selecting, based upon a prescribed mathematical algorithm, elements to physically realize the network, the algorithm being determined to satisfy the requirements of size and depth, and (2) interconnecting the selected elements to realize the network.
A preferred method for realizing a 2k+1xc3x972k+1 running-sum adder network having 2k+1 stages includes: (a) logically stacking two 2kxc3x972k adder networks vertically resulting in a first logical network having have 2k+1 lines and 2kxe2x88x921 stages; (b) logically horizontally splitting the first logical network into two halves, kxe2x88x921 stages on the left and k stages on the right, and then inserting two stages in-between the two halves, resulting in a second logical network having 2k+1 lines and 2k+1 stages; and (c) physically inserting k+1 adder elements at (x, 2k) and k+1 fan-out elements at (x, 2kxe2x88x922xxe2x88x921) where x=1, 2, . . . , k+1, respectively, and then connecting an output of each inserted fan-out element with an input of each inserted adder element in the same stage.
In accordance with preferred circuitry, a 2k+1xc3x972k+1 running-sum adder network having 2k+1 stages includes:
(a) two 2kxc3x972k adder networks stacked vertically resulting in a first sub-network having have 2k+1 lines and 2kxe2x88x921 stages,
(b) two stages in-between two parts of the first sub-network, the parts obtained by splitting the first sub-network into kxe2x88x921 stages on the left and k stages on the right, resulting in a second sub-network having 2k+1 lines and 2k+1 stages,
(c) k+1 adder elements at (x, 2k) of the sub-network and k+1 fan-out elements at (x, 2kxe2x88x922xxe2x88x921) of the second sub-network, where x=1, 2, . . . , k+1, respectively, with an output from each k+1 fan-out element being connected to an input of each k+1 adder element in the same stage.
As a primary feature, the subject matter of the present invention engenders a balance between the two extremes (one with small size but high latency and the other with low latency but large size). On one hand, according to the design methodology, the size of an Mxc3x97M adder network is only about 2M when compared with xcexa9(M┌log2M┐). As the size increases linearly with M, it is a great savings when M is large. On the other hand, its depth is on the order of log2M and, therefore, is much shorter when compared with the serial implementation. Accordingly, it is especially cost-effective for those applications that require a large running-sum adder with small latency. Moreover, its design is also simple. In addition, it can be constructed recursively from the smaller adder networks.
In general, a 2k+1xc3x972k+1 running-sum adder is composed of two 2kxc3x972k adder networks, plus k+1 pairs of 2xc3x971adder elements and 1xc3x972 fan-out elements. The size of such a multi-stage network is just 2k+2xe2x88x92kxe2x88x923 and the depth is 2k+1. Moreover, recursive construction improves the structural modularity of the adder network.