1. Field of the Invention
The present invention relates to digital circuitry for performing an addition operation. More specifically, the present invention relates to a method and an apparatus for performing an addition operation using a carry circuit with a regular structure.
2. Related Art
Binary addition of two n-bit numbers and a carry-in bit involves computing the sum
si=aixorbixorcixe2x80x83xe2x80x83(EQ 1)
for each bit i (nxe2x89xa7ixe2x89xa71). The inputs ai and bi are given, but the carry-in ci to each bit must be computed based on all of the less significant bits and the carry-in bit (bit 0). Therefore, the fundamental problem of addition is computing the carries for each bit.
There are a multitude of approaches to carry generation offering tradeoffs among speed, area, and ease of layout. Microprocessors generally require maximum speed and thus employ some form of prefix computation to produce all of the carry signals in parallel. Such computations are based on the notion of generate, propagate, and kill (g, p, k) defined for a bit or group of bits. For brevity, we will use the term group, with the understanding that a bit is a group of one. Each group of bits may receive a carry-in signal. We would like to know if the group will produce a carry-out signal. Generate means that the group will produce a carry-out signal independent of whether a carry-in signal arrives. Propagate means the group will produce a carry-out signal if and only if a carry-in signal arrives. Kill means that the group will not produce a carry-out signal even if a carry-in signal arrives.
We define a group with the subscript i:j to span bits i . . . j inclusive (nxe2x89xa7ixe2x89xa7jxe2x89xa70). A single bit group corresponding to bit i is given the name i:i. We first define g, p, and k for single bit groups in terms of the inputs a and b:
gi:i=ai*bixe2x80x83xe2x80x83(EQ 2)
pi:i=aixorbixe2x80x83xe2x80x83(EQ 3)
ki:i=not(ai)*not(bi)xe2x80x83xe2x80x83(EQ 4)
We now define g, p, and k for multibit groups recursively in terms of shorter groups spanning bits i:m and (mxe2x88x921):j (ixe2x89xa7m greater than j). We define the more significant group i:m to be the top part and the less significant group (mxe2x88x921):j to be the bottom part. A group generates a carry if the top part of the group generates or if the top part propagates and the bottom part generates. A group propagates a carry if both the top and bottom parts of the group propagate. The group kills the carry if the top part kills or if the top part propagates and the bottom part kills.
gi:j=gi:m+pi:m*g(mxe2x88x921)xc2x7jxe2x80x83xe2x80x83(EQ 5)
pi:j=pi:m*p(mxe2x88x921):jxe2x80x83xe2x80x83(EQ 6)
ki:j=ki:m+pi:m*k(mxe2x88x921):jxe2x80x83xe2x80x83(EQ 7)
We can consider bit 0 to describe the carry-in signal (cm) to the adder. Bit 0 will generate if the adder receives a carry-in and kill otherwise. Therefore, we define the base case:
g0=cinxe2x80x83xe2x80x83(EQ 8)
p0=0xe2x80x83xe2x80x83(EQ 9)
k0=not(cin)xe2x80x83xe2x80x83(EQ 10)
From these equations, we can determine the carry-in to each bit of the adder:
ci=g(ixe2x88x921):0xe2x80x83xe2x80x83(EQ 11)
In other words, we receive a carry-in signal to the ith bit if and only if the less significant bits of the adder, including cin, collectively generate a carry. Observe that for all i, the prefix propagate pi:0 is 0 because bit 0 always either generates or kills and a group propagates only if all the bits in the group propagate. Therefore, the generate and kill signals gi:0 and ki:0 are complementary. This may be useful because CMOS implementations of the final sum XOR gate require true and complementary versions of ci.
In summary, in a prefix adder we perform three operations:
(1) compute the pgk terms for each bit using (EQ 2)-(EQ 4), (EQ 8)-(EQ 10);
(2) compute the carry-in for each bit using (EQ 5)-(EQ 7), (EQ 11); and
(3) compute the sum for each bit using an XOR gate using (EQ 1).
Steps 1 and 3 are trivial, so we will focus on step 2. The recursive nature of the group pgk equations leads to circuits in the form of trees.
There are a variety of known trees offering tradeoffs between speed, gate count, and bisection width, i.e. the number of wires crossing the middle of the adder. The Brent-Kung tree is shown in FIG. 1B; an Elementary Prefix tree is shown in FIG. 2; and the Kogge-Stone tree is shown in FIG. 3. In each of these adder tree diagrams, a line indicates a bus carrying the three pgk signals. Rounded rectangles indicate logic blocks performing the logic of (EQ 5)-(EQ 7). Each block is labeled with the index of the group pgk signals it computes. Triangles represent buffers and may be omitted altogether if fine-grained pipelining is not required. The inputs at the bottom of each figure are the single-bit pgk signals produced in step 1. The outputs at the top of each figure are the full prefix pgk signals including ci=gixe2x88x921:0 used as the carry-in to the step 3 XOR.
Table 1 compares the three carry trees as a function of the number of bits n. Delay is measured in number of stages; this is an oversimplification because long wires or large fanouts will increase the delay of each stage. The total number of logic blocks is related to the number of transistors in the tree; buffers are not considered. We define fanout as the number of blocks receiving a signal divided by the number of ganged drivers of that signal. These prior art designs all have a single driver for each signal, so the fanout is simply the number of receivers. The lateral tracks row of the table describes the maximum number of busses running between bits of the tree.
The Brent-Kung tree has the worst delay but the fewest number of logic blocks. The Elementary Prefix adder uses fewer levels of logic. Unfortunately, the fanout between levels grows with the number of bits being added. This increases the delay of the adder and requires some cells to use transistors wider than others to drive the greater loads. The irregular widths increase the number of unique cells that must be laid out and verified. The Kogge-Stone adder solves the fanout problem by distributing computations, achieving good delay and constant fanout at the expense of more logic blocks than the Brent-Kung tree. However, the distributed computation leads to a number of lateral tracks that increases with n. These long wires occupy much area and consume more power when driven. All of the trees also involve driving wires of different lengths at different stages. The capacitance and sometimes resistance of the long wires dominates the stage delay and requires larger drivers in some cells, reducing the regularity or performance of the design.
What is needed is a method and an apparatus for performing a fast addition operation with a limited fanout for logic blocks and with a limited number of lateral tracks between successive stages of logical blocks.
One embodiment of the present invention provides an apparatus for facilitating an addition operation between two N-bit numbers, wherein the apparatus has a regular structure. The apparatus includes a carry circuit for generating at least one carry signal for the addition operation, wherein the carry circuit includes a plurality of logic blocks organized into rows that form approximately logN successive stages of logic blocks. Each of these logic blocks provides current for at most a constant number of inputs in a successive stage of logic blocks. Additionally, within a given stage of logic blocks, outputs from multiple logic blocks are ganged together to drive a signal line that feeds multiple inputs in a successive stage of logic blocks. Furthermore, there are at most a constant number of lateral tracks in a planar layout of signal lines between the successive stages of logic blocks. Hence, the present invention can reduce layout and design effort, while producing a regularized layout that takes up a small amount of space on a semiconductor chip.
Note that this embodiment of the present invention offers the minimal number of stages, like the Kogge-Stone or Elementary Prefix adder, but a constant number of lateral tracks, like the Brent-Kung adder, while preserving constant fanout.
One embodiment of the present invention uses two types of gates within the prefix tree, providing a simple and regular layout.
One embodiment of the present invention additionally includes a plurality of XOR gates coupled to the carry circuit in order to perform the addition operation.
One embodiment of the present invention additionally includes a plurality of buffers located within the successive stages to facilitate pipelining between the successive stages, so that multiple addition operations can flow through the apparatus at the same time.
One embodiment of the present invention additionally includes an asynchronous control mechanism that facilitates an asynchronous transfer of data between successive stages of logic blocks.
In one embodiment of the present invention, outputs of the plurality of logic blocks have drivers of the same size.
In one embodiment of the present invention, there are (3N/2)log2 Nxe2x88x92N/2 logic blocks excluding buffers. In this embodiment, the maximum fanout of any output from a logic block is three, and the maximum number of lateral tracks between the successive stages of logic blocks is two.
In one embodiment of the present invention, there are N(log2Nxe2x88x921/2) logic blocks. In this embodiment, the maximum fanout of any output from a logic block is four, and the maximum number of lateral tracks between the successive stages of logic blocks is two. In a variation on this embodiment, buffers are used to pipeline early results.
In one embodiment of the present invention, bits of the apparatus are folded so that bit Nxe2x88x92i is adjacent to bit i in order to reduce resistance caused by long wires.
In one embodiment of the present invention, each logic block generates the following signals: g_hi:j=g_hi:m+g_h(mxe2x88x921):j*k_li:m; g_li:j=k_hi:m+g_l(mxe2x88x921):j*g_li:m; k_hi:j=k_hi:m+k_h(mxe2x88x921):j*g_li:m; and k_li:j=g_hi:m+k_l(mxe2x88x921):j*k_li:m.
In one embodiment of the present invention, each logic block generates the following signals: gi:j=gi:m+pi:m*g(mxe2x88x921):j; and pi:j=pi:m*p(mxe2x88x921):j; and ki:j=ki:m+pi:m*k(mxe2x88x921):j.
In one embodiment of the present invention, N equals one of, 16, 32, 64 and 128.
In one embodiment of the present invention, the carry circuit has a radix higher than two.