1. Field of the Invention
The present invention relates to a binary digital arithmetic unit, and more particularly to a tree circuit used in a parallel multiplier circuit for multiplication of multiplicand and multiplier of signed numbers in the two's complement representation to obtain a product of signed number in the two's complement representation.
2. Description of the Background Art
In general, recent microprocessors and DSPs (Digital Signal Processors) are equipped with a parallel multiplier for fast execution of multiplication instructions. The parallel multiplier circuit generates a plurality of partial products from multiplier and multiplicand as input operands for multiplication and adds up these partial products to obtain a multiplication result, i.e., a product. Accordingly, approaches to attain the following two objects are proposed as a technique of speeding up the operation of the parallel multiplier circuit.
The first object is to reduce the number of partial products to be generated. To attain this object, the Booth algorithm, especially the secondary Booth algorithm, is typically used. The second object is to perform fast addition of a plurality of the partial products. To attain this object, a circuit system to achieve a parallel operation of fast adder circuits is required.
A background-art fast multiplier circuit will be discussed, taking a circuit for performing multiplication of a 32-bit signed multiplicand X in the two's complement representation by a 32-bit signed multiplier Y in the two's complement representation to obtain a 64-bit signed product Z in the two's complement representation (the circuit is abbreviatedly referred to as "32.times.32 multiplier" hereinafter) as an example.
By generation of a partial product for each bit of the multiplier Y, thirty-two partial products are generated, and further these partial products need to be added up. However, accordingly the secondary Booth algorithm, a set of adjacent three bits of the multiplier Y is dealt with as a unit to reduce the number of partial products. Thus, the first object is attained.
Specifically, assuming that y.sub.i (i=0 to 31) is 0 or 1, the multiplier Y is expressed using 32-bit signed number in the two's complement representation as, ##EQU1## (where Y.sub.-1 .ident.0)
Thus, to obtain the product Z, it is only needed to add up sixteen partial products P.sub.j (j=0 to 15).
Table 1 shows a truth table of the secondary Booth algorithm.
TABLE 1 ______________________________________ y.sub.2j+1 y.sub.2j y.sub.2j-1 P.sub.j pp.sub.j pc.sub.j ______________________________________ 0 0 0 0 0 0 0 0 1 +X .multidot. 2.sup.j X 0 0 1 0 +X .multidot. 2.sup.j X 0 0 1 1 +X .multidot. 2.sup.j+1 2X 0 1 0 0 -X .multidot. 2.sup.j+1 .sup..about. 2X 1 1 0 1 -X .multidot. 2.sup.j .sup..about. X 1 1 1 0 -X .multidot. 2.sup.j .sup..about. X 1 1 1 1 0 0 0 ______________________________________
In table 1, ".about." denotes logical inversion, and there are eight combinations of possible values of adjacent three bits of the multiplier Y. Accordingly, the partial product P.sub.j takes one of 0, +X.multidot.2.sup.j, +X.multidot.2.sup.j+1, -X.multidot.2.sup.j, -X.multidot.2.sup.j+1. In binary digital arithmetic operation using the two's complement representation system, "multiplication of data by two" is achieved by shifting the whole data upwardly by one bit and "sign-inversion" is achieved by inverting all the bits of the data (by which the value of the first element pp.sub.j of the partial product is inverted) and adding 1 to the least significant bit (by which the second element pc.sub.j of the partial product takes "1"). Then, the partial product P.sub.j is expressed as EQU P.sub.j =(pp.sub.j +pc.sub.j).multidot.2.sup.2j ( 3)
Accordingly, to add up the sixteen partial products P.sub.0 to P.sub.15 generated according to the secondary Booth algorithm, it is needed to add the first element pp.sub.j having thirty-three bits of the partial product of which the least significant bit is the 2j-th bit when j=0 to 15 (specifically, the bit positions range from 2j to 2j+32, and the position higher than the thirty-second bit position by one bit is needed because of a possibility of multiplying 32-bit data by 2) and the second element pc.sub.j having one bit of the partial product on the 2j-th bit (in other words, 2.sup.2j represents the scale of the first and second elements).
To attain the second object, specifically, to perform fast addition of partial products, a carry-save technique, a Wallace-Tree technique and the like are typically used as the circuit system to achieve the parallel operation of the fast adder circuits. Using any one of the techniques, a plurality of (sixteen here) intermediated sums are added in the form of a tournament while being compressed, to ultimately provide two intermediated sums (the sums are referred to as "eventual intermediate sums" hereinafter). Carry signals generated during the process to obtain the eventual intermediate sums are postponed to the subsequent-stage addition. Propagations of the carry signals to higher-order bits are parallelly performed and a critical path (path for controlling the rate of circuit operation) is shortened on the whole, to ensure fast addition.
Final addition of the two eventual intermediate sums provides a product of the multiplicand and the multiplier. The final addition is performed for the two sums each consisting of a plurality of bits at high speed by using e.g., carry-lookahead system. The final addition will not be discussed since it is well-known technique.
The technique, to attain the second object, of adding a plurality of partial products in the form of a tournament while sequentially compressing to eventually generate the two eventual intermediate sums will be examined in detail and then present a problem of the background art.
FIG. 13 is a block diagram of the background art to implement the wallace-Tree technique. In this figure, 4-input 2-output addition blocks 22a to 22g are interconnected in a tree structure. Further, a 3-input 2-output addition block 24a is provided to receive an output of the 4-input 2-output addition block 22g.
FIGS. 14A to 14C are block diagrams cooperatively showing the detail of FIG. 13. FIG. 14 is a schematic diagram showing the connection between FIGS. 14A to 14C. FIG. 14A is continuous with FIG. 14B at a virtual line Q19--Q19 and FIG. 14B is continuous with FIG. 14C at a virtual line Q20--Q20. The width of each addition block corresponds to the bit width thereof and the position in a horizontal direction corresponds to the bit position.
FIGS. 15A to 15C are block diagrams cooperatively illustrating a configuration of the 4-input and 2-output addition block 22a. FIG. 15A is continuous with FIG. 15B at a virtual line Q22--Q22 and FIG. 15B is continuous with FIG. 15C at a virtual line Q23--Q23. The 4-input 2-output addition block 22a consists of thirty-five 4-input 2-output adders 200 each for one bit which are connected in series. A carry-out Co of the 4-input 2-output adder 200 on each bit position becomes a carry-in Ci of the 4-input 2-output adder 200 on the higher-next bit position. If the carry-out Co is not dependent on the carry-in Ci in a 4-input 2-output adder for one bit, the carry-out Co is not propagated across the next bit within the 4-input 2-output addition block consisting of the 4-input 2-output adders connected in series.
In the background-art addition of partial products shown in FIGS. 14A to 14C, the 3-input 2-output addition block 24a in the fourth stage is needed only for adding the second element pc.sub.15 having one bit of the partial product to the lower output so.sub.27 and the upper output co.sub.27 of the 4-input 2-output addition block 22g. The second element pc.sub.15 of the partial product is an obstacle to the speeding-up of the multiplier and the 3-input 2-output addition block 24a is an obstacle to high integration of the circuit.
Specifically, the speed of the multiplier is estimated as follows. The 3-input 2-output addition block 24a consists of 3-input 2-output adders 400 each for one bit connected in series, and one of them is shown in a circuit diagram of FIG. 16. The truth table of the 3-input 2-output adder 400 is shown in Table 2.
TABLE 2 ______________________________________ A B C SO CO ______________________________________ 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 1 0 1 1 0 0 1 0 1 0 1 0 1 1 1 0 0 1 1 1 1 1 1 ______________________________________
In general, the delay time of one stage of exclusive OR gate (referred to as "XOR" hereinafter) is larger than that of other AND or OR gate, or compound gate, and is equivalent to about two stages thereof. For example, the critical path of the 3-input 2-output adder 400 of FIG. 16 goes through two stages of XORs.
FIG. 17 is a block diagram showing a configuration of the 4-input 2-output adder 200. The 4-input 2-output adder 200 for one bit can be constituted of two 3-input 2-output adders 400 for one bit. In this case, the critical path of the adder goes through four stages of XORs.
With devised configuration of the 4-input 2-output adder 200, the delay time required in this case can be further reduced. Table 3 shows a truth table of an exemplary function that the 4-input 2-output adder 200 should satisfy.
TABLE 3 ______________________________________ A B C D SO Co CO ______________________________________ 0 0 0 0 Ci 0 0 0 0 0 1 .sup..about. Ci 0 Ci 0 0 1 0 .sup..about. Ci 0 Ci 0 0 1 1 Ci 0 1 0 1 0 0 .sup..about. Ci 0 Ci 0 1 0 1 Ci 1 0 0 1 1 0 Ci 1 0 0 1 1 1 .sup..about. Ci 1 Ci 1 0 0 0 .sup..about. Ci 0 Ci 1 0 0 1 Ci 1 0 1 0 1 0 Ci 1 0 1 0 1 1 .sup..about. Ci 1 Ci 1 1 0 0 Ci 0 1 1 1 0 1 .sup..about. Ci 1 Ci 1 1 1 0 .sup..about. Ci 1 Ci 1 1 1 1 Ci 1 1 ______________________________________
FIG. 18 is a circuit diagram of an exemplary circuit which satisfies the truth table of Table 3. The critical path is a path to obtain an output SO of the 4-input 2-output adder 200. The output SO is an exclusive OR of five signals, i.e., inputs A, B, C, D and carry-in signal Ci. As can be seen from FIG. 18, the exclusive OR of the inputs A and B and the exclusive OR of the inputs C and D are parallelly processed, and after all, the critical path goes through three stages of XORs. For convenience, assuming that the delay of the 3-input 2-output adder 400 is two stages of XORs and that of the 4-input 2-output adder 200 is three stages of XORs, discussion will be presented below.
Since the carry-out Co is not propagated across the next bit as mentioned above, the delay times of the addition blocks 22a to 22g and 24a directly depend on the delay times of the adders 200 and 400.
Provided that the multiplicand X and the multiplier Y as inputs of the multiplier are inputted at the same time, the values of the first elements pp.sub.0 to pp.sub.15 and the second elements pc.sub.0 to pc.sub.15 of the partial product generated according to the secondary Booth algorithm are determined at the same time.
In these addition blocks of FIGS. 14A to 14C, the addition is performed in the order of the first stage of the tree circuit (the 4-input 2-output addition blocks 22a to 22d), the second stage (the 4-input 2-output addition blocks 22e and 22f), the third stage (the 4-input 2-output addition block 22g) and the fourth stage (the 3-input 2-output addition block 24a). Accordingly, the delay time from the determination of the first elements pp.sub.0 to pp.sub.15 and the second elements pc.sub.0 to pc.sub.15 of the partial product to the determination of the lower output so.sub.28 and the upper output co.sub.28 of the 3-input 2-output addition block 24a as the two eventual intermediate sums is eleven (=3.times.3+2) stages of XORs since the critical path goes through three stages of 4-input 2-output addition blocks and one stage of 3-2 addition block.
The tree circuit for adding up sixteen partial products to generate the two eventual intermediate sums, which is constituted mainly of 4-input 2-output addition blocks in FIG. 14A, may be constituted of 3-input 2-output addition blocks also in stages other than the final stage of the tree circuit.
FIG. 19 is a block diagram showing a configuration of a tree circuit where the 3-input 2-output addition blocks account for larger part. FIGS. 20A to 20D are block diagrams cooperatively showing the detail of FIG. 19. FIG. 20 is a schematic diagram showing the connection between FIGS. 20A to 20D. FIG. 20A is continuous with FIG. 20B at a virtual line Q29--Q29 and FIG. 20B is continuous with FIG. 20C at a virtual line Q31--Q31. Like FIGS. 14A to 14C, the width of each addition block corresponds to the bit width thereof and the position in a horizontal direction corresponds to the bit position.
Outputs from 4-input 2-output addition blocks 32a to 32d are inputted to 3-input 2-output addition blocks 34a to 34c, outputs from the 3-input 2-output addition blocks 34a to 34c are inputted to 3-input 2-output addition blocks 34d to 34e, and outputs from the 3-input 2-output addition blocks 34d to 34e are inputted to a 4-input 2-output addition block 32e. The 4-input 2-output addition blocks outputs a lower output so.sub.40 and an upper output co.sub.40 as the two eventual intermediate sums.
Unlike the tree circuit of FIG. 13, the second elements pc.sub.j are collected in the order of j and inputted to the 4-input 2-output addition block 32a as ppc. That is expressed as ##EQU2##
In FIGS. 19 and 20A to 20D, the addition is performed in the order of the first stage of the tree circuit (the 4-input 2-output addition blocks 32a to 32d), the second stage (the 3-input 2-output addition blocks 34a to 34c), the third stage (the 3-input 2-output addition block 34d and 34e) and the fourth stage (the 4-input 2-output addition block 32e). Accordingly, the delay time from the determination of the first elements pp.sub.0 to pp.sub.15 and the second elements pc.sub.0 to pc.sub.15 of the partial product to the determination of the lower output so.sub.40 and the upper output co.sub.40 of the 4-input 2-output addition block 32e as the two eventual intermediate sums is ten (=3.times.2+2.times.2) stages of XORs since the critical path goes through two stages of 4-input 2-output addition blocks and two stages of 3-input 2-output addition blocks. Thus, the delay time is improved in this configuration as compared with that of FIGS. 13 and 14A to 14C.
However, there are disadvantages that the number of addition blocks increases by two and the circuit scale is enlarged. That results from that the 3-input 2-output addition block deals with one less inputs parallelly, though its delay time is shorter than that of the 4-input 2-output addition block.
FIG. 21 is a block diagram of a tree circuit for generating the eventual intermediate sums in a circuit for multiplication of a multiplicand and a multiplier of 24-bit signed numbers in the two's complement representation to obtain a product of 48-bit signed number in the two's complement representation. In this circuit, twelve partial products are generated according to the secondary Booth algorithm, and added up in the form of a tournament while being compressed to eventually provide the two eventual intermediate sums.
The tree circuit is constituted of 4-input 2-output addition blocks 42a to 42e and a 3-input 2-output addition block 44a. In these addition blocks, the addition is performed in the order of the first stage of the tree circuit (the 4-input 2-output addition blocks 42a to 42c), the second stage (the 4-input 2-output addition block 42d and the 3-input 2-output addition block 44a) and the third stage (the 4-input 2-output addition block 42e), and ppc is expressed as ##EQU3##
In a case of three stages of 4-input 2-output addition blocks (through the addition blocks 42a (or 42b), 42d and 42e), the delay time is longer than a case of two stages of 4-input 2-output addition blocks and one stage of 3-input 2-output addition block (through the addition blocks 42c, 44a and 42e). Accordingly, the delay time from the determination of the first elements pp.sub.0 to pp.sub.11 and the second elements pc.sub.0 to pc.sub.11 of the partial product to the determination of the lower output so.sub.46 and the upper output co.sub.46 of the 4-input 2-output addition block 42e as the two eventual intermediate sums is nine stages of XORs which corresponds to three stages of 4-input 2-output addition blocks.
As discussed above, the times for determination of input data of the 4-input 2-output addition block 42e in the two cases are not the same. Specifically, the lower output so.sub.45 and the upper output co.sub.45 of the 3-input 2-output addition block 44a are determined earlier than the lower output so.sub.44 and the upper output co.sub.44 of the 4-input 2-output addition block 42d by one stage of XOR. Further, the first element pp.sub.11 of the input data of 3-input 2-output addition block 44a is determined earlier than the lower output so.sub.43 and the upper output co.sub.43 of the 4-input 2-output addition block 42c by three stages of XORs.
In the background art, the circuit operation of the tree circuit is performed with low parallelism in some cases depending on the bit width of the input data for multiplication. In other words, disadvantageously, speeding-up of the multiplier is not achieved because the timing of determining the input data of the circuit blocks constituting the tree circuit is not uniform.