1. Field of the Invention
The present invention relates to circuitry for performing computational operations. More specifically the present invention relates to designs for parallel prefix networks that perform prefix computations, wherein the designs make tradeoffs between number of logic levels, fanout, and number of horizontal wiring tracks between logic levels.
2. Related Art
In order to keep pace with continually increasing microprocessor clock speeds, computational circuitry within the microprocessor core must perform computational operations at increasingly faster rates. Parallel prefix networks are widely used to speed up such computational operations, for example in performing a high-speed addition operation.
A parallel prefix circuit computes N outputs {YN, . . . , Y1} from N inputs {XN, . . . , X1} using an arbitrary associative two-input operator ∘ as follows
                                          Y            1                    =                      X            1                          ⁢                                  ⁢                              Y            2                    =                                    X              2                        ∘                          X              1                                      ⁢                                  ⁢                              Y            3                    =                                    X              3                        ∘                          X              2                        ∘                          X              1                                      ⁢                                  ⁢                                  ⁢        ⋮        ⁢                                  ⁢                              Y            N                    =                                    X              N                        ∘                          X                              N                -                1                                      ∘                                                  ⁢            …            ⁢                                                  ∘                          X              2                        ∘                          X              1                                                          (        1        )            
Common prefix computations include addition, incrementation, priority encoding, etc. Most prefix computations precompute intermediate variables {ZN:N, . . . , Z1:1} from the inputs. The prefix network combines these intermediate variables to form the prefixes {ZN:1, . . . , Z1:1}. The outputs are postcomputed from the inputs and prefixes.
For example, adders take inputs {AN, . . . , A1}, {BN, . . . , B1} and Cin and produce a sum output {SN, . . . , S1} using intermediate generate (G) and propagate (P) prefix signals. The addition logic performs the following calculations, and an example of a corresponding 4-bit addition circuit is illustrated in FIG. 1.Precomputation: Gi:i=Ai·Bi G0:0=CinPi:i=Ai⊕BiP0:0=0  (1)Prefix: Gi:j=Gi:k+Pi:k·Gk−1:jPi:j=Pi:k·Pk−1:j  (2)Postcomputation: Si=Pi⊕Gi−1:0  (3)
A priority encoder may be structured similarly with different logic functions in each portion:Xi:i=Āi bitwise precomputationXi:j=Xi:k·Xk−1:j group logic  (5)Yi=AiXi−1:1 output logic
There are many ways to perform the prefix computation. For example, serial-prefix structures such as ripple carry adders are compact but have a latency O(N). Single-level carry lookahead structures reduce this latency by a constant factor. More significantly, parallel prefix circuits use a tree network to reduce this latency to O(log N) and are widely used in fast adders, priority encoders, and other circuits that perform prefix computations. (Priority encoders are described in C. Huang, J. Wang, and Y. Huang, “Design of high-performance CMOS priority encoders and incrementer/decrementers using multilevel lookahead and multilevel folding techniques,” IEEE J. Solid-State Circuits, vol. 37, no. 1, pp. 63–76, January 2002.)
Many parallel prefix networks have been described in the literature, especially in the context of addition. The classic networks include Sklansky (see J. Sklansky, “Conditional-sum addition logic,” IRE Trans. Electronic Computers, vol. EC-9, pp. 226–231, June 1960.), Brent-Kung (see R. Brent and H. Kung, “A regular layout for parallel adders,” IEEE Trans. Computers, vol. C-31, no. 3, pp. 260–264, March 1982.) and Kogge-Stone (see P. Kogge and H. Stone, “A parallel algorithm for the efficient solution of a general class of recurrence relations,” IEEE Trans. Computers, vol. C-22, no. 8, pp. 786–793, August 1973.).
An ideal prefix network has log2N stages of logic, a fanout never exceeding 2 at each stage, and no more than one horizontal track of wire at each stage. The classic architectures of Brent-Kung, Sklansky, and Kogge-Stone deviate from this ideal with 2log2N stages, fanout of N/2+1, and N/2 horizontal tracks, respectively. The Ladner-Fischer family of networks offers tradeoffs in fanout and stages between Sklansky and Brent-Kung (see R. Ladner and M. Fischer, “Parallel prefix computation,” J. ACM, vol. 27, no. 4, pp. 831–838, October 1980.). Similarly, the Han-Carlson family of networks trades off stages and wiring between Brent-Kung and Kogge-Stone (see T. Han and D. Carlson, “Fast area-efficient VLSI adders,” Proc. 8th Symp. Comp. Arith., pp. 49–56, September 1987.). Finally, the Knowles family trades off fanout and wiring between Ladner-Fischer and Kogge-Stone (see [K01] S. Knowles, “A family of adders,” Proc. 15th IEEE Symp. Comp. Arith., pp. 277–281, June 2001.). The Kowalczuk-Tudor-Mlynek prefix network has also been proposed, but this network is serialized in the middle and hence is not as fast for wide adders (see J. Kowalczuk, S. Tudor, and D. Mlynek, “A new architecture for an automatic generation of fast pipeline adders,” Proc. European Solid-State Circuits Conf., pp. 101–104, 1991.).
Parallel prefix networks are distinguished by the arrangement of prefix cells. FIGS. 2A–2G illustrate seven such networks for N=16. The upper box performs the precomputation and the lower box performs the postcomputation. In the middle, black cells (cross-hatched), gray cells (single-hatched), and white buffers comprise the prefix network. Black cells perform the full prefix operation, as given in equation (2). In certain cases (represented by gray cells), only part of the intermediate variable is required. For example, in many adder cells, only the Gi:0 signal is required, and the Pi:0 signal may be discarded. Such gray cells have lower input capacitance. White buffers are used to reduce the loading of later non-critical stages on the critical path. In FIGS. 2A–2G, the span of bits covered by each cell output is listed near the cell's output. Moreover, the critical path is indicated with a heavy line.
The prefix networks in FIGS. 2A–2G illustrate the tradeoffs in each network between number of logic levels, fanout, and horizontal wiring tracks. All three of these tradeoffs impact latency. For example, Huang and Ercegovac have shown that networks with large number of wiring tracks increase the wiring capacitance because the tracks are packed on a tight pitch to achieve reasonable area (see Z. Huang and M. Ercegovac, “Effect of wire delay on the design of prefix adders in deep submicron technology,” Proc. 34th Asilomar Conf. Signals, Systems, and Computers, vol. 2, pp. 1713–1717, 2000.).
Observe that the Brent-Kung, Han-Carlson, and Ladner-Fischer designs never have more than one black (cross-hatched) or gray (single-hatched) cell in each pair of bits on any given row. This suggests that the datapath layout may use half as many columns, which saves area and reduces wire length.
Also note that when the Knowles network is used for addition, propagate must be defined with an OR rather than an XOR. We can see this by considering the gray cell computing G8:0=G8:1+P8:1G1:0. If A1=B1=1, the logic is correct for P1=B1 but not for P1=A1⊕A1⊕B1.
Although the above-described parallel prefix networks generally make reasonable tradeoffs between logic levels, fanout and number of horizontal wiring tracks between logic levels, they do not cover all possible points in the design space. Hence, they do not provide optimal parallel prefix networks under certain assumptions for relative costs between logic levels, fanout and wiring tracks.