1. Field of the Invention
The present invention generally relates to data processing systems, more specifically to logic circuits which perform arithmetic functions, and particularly multipliers that use Booth encoders.
2. Description of the Related Art
Data processing systems such as computers or digital signal processors use electronic logic circuits to perform various arithmetic functions including addition, multiplication, etc. These calculations are carried out by an arithmetic logic unit using binary arithmetic. Two common arithmetic logic units are the fixed-point unit and the floating-point unit. A fixed-point unit is used for integer operations, i.e., the source operands for a fixed-point unit are both integers. A floating-point unit is used for non-integer multiplication and division. In order to understand how an arithmetic logic unit (ALU) carries out binary arithmetic, it is useful to understand floating-point notation. Floating-point notation (also referred to as exponential notation) can be used to represent both very large and very small numbers, and thus makes it easier to handle a wide range of arithmetic operations.
A floating-point notation has three parts: a mantissa (or significand), an exponent, and a sign (positive or negative). The mantissa specifies the digits of the number, and the exponent specifies the magnitude of the number, i.e., the power of the base which is to be multiplied with the mantissa to generate the number. For example, using base 10, the number 28330000 would be written as 2833E+4, and the number 0.054565 would be represented as 54565E−6. However, since processors use binary values, an ALU uses 2 as the base (radix). Thus, a number may generally be expressed in binary terms according to the formn=(−1)S×1.F×2E,where n is the floating point number (in base 10), S is the sign of the number (0 for positive or 1 for negative), F is the fractional component of the mantissa (in base 2), and E is the exponent of the radix.
In most conventional computer systems, an integer number is represented with 32 bits or 64 bits. Similarly, a “single precision” floating point number is represented with 32 bits (one word), and a “double-precision” floating point number is represented using a 64-bit (two word) field. In accordance with the Institute of Electronic and Electrical Engineers (IEEE) standard 754, a single-precision floating-point number uses the 32 bits as follows: the first bit indicates the sign (S), the next eight bits indicate the exponent offset by a bias amount of 127 (E+bias), and the last 23 bits indicate the fractional portion (F) of the mantissa. So, for example, the decimal number ten would be represented by the 32-bit binary value                0 10000010 01000000000000000000000as this corresponds to (−1)0×1.012×2130−127=1.25×23=10.        
A typical floating-point unit (FPU) has three inputs for receiving three input operands A, B and C, respectively, expressed as floating-point numbers in accordance with the foregoing discussion. The FPU then uses these operands to perform a “multiply-add” instruction, which executes the arithmetic operation ±[(A×C)±B]. The exponent portions of operands A, B and C are provided to an exponent calculator. The mantissa portions of operands A and C are provided to a multiplier unit, while the mantissa portion of operand B is provided to an alignment shifter. A typical fixed-point unit (FXU) operates in a similar manner and also uses a multiplier unit to carry out integer multiplication.
A conventional multiplier unit uses the mantissas of operands A and C and reduces the arithmetic function (A×C) to two intermediate results, known as “sum” and “carry.” These intermediate results are passed to an adder/incrementer in arithmetic logic unit (i.e., FXU or FPU). The sum and carry results operate with the exponent calculator and alignment shifter to combine the values with operand B and yield the final result of the multiply-add instruction.
With the current imperatives of energy efficiency and computation throughput, most reduced instruction set computers (RISCs) and digital signal processors (DSPs) use Booth-encoder multipliers to simultaneously address the low-power and high-speed design issues. In general, there are two basic approaches to enhance the speed of multipliers, one is the Booth algorithm and the other is the Wallace tree compressor or counter. The Booth algorithm operates on the principle that multiplication can be considered as a series of repeated additions. The number to be added is the multiplicand, the number of times that it is added is the multiplier, and the result is the product. Each step of the addition process generates a partial product. It is possible to decompose multipliers into two parts. The first part is dedicated to the generation of partial products, and the second one collects and adds them. The Booth algorithm (or a modified Booth algorithm) is used because it reduces the number of partial products. The set of partial product bits is then collected using a Wallace tree.
A typical prior art hardware multiplier 10 is illustrated in FIG. 1. A Booth encoder 12 receives the mantissa portion of operand C, and encodes the number according to the Booth algorithm. The encoded values are then forward to a multiplexer or Booth selector 14, which also has as an input the mantissa portion of operand A. Multiplexer 14 calculates the partial product bits for A×C, and then passes on them on to a Wallace tree 16 which combines the bits to generate partial products. The output of Wallace tree 16 feeds a ripple carry adder 18 which uses the partial products to figure the sum and carry results.
An exemplary Booth encoder operates sequentially on three bits of the multiplier operand C in two-bit steps, i.e., C(0 . . . 2), C(2 . . . 4), C(4 . . . 6), and so on. The Booth encoder recodes these three bits into an output value. In one known method, the output is represented as four signals named “1x,” “−1x,” “2x” and “−2x”. Table 1 shows the cell encoding for multiplier bits C(0 . . . 2) as inputs according to this Booth algorithm:
TABLE 1InputOutputC(0 . . . 2)−2x−1x1x2x10010001010100110010011100000000000001001001000100110001Those skilled in the art will appreciate that the output values shown in Table 1 are only for the particular three input bits C(0 . . . 2) and that the outputs from the Booth encoder cells whose inputs are C(2 . . . 4), C(4 . . . 6), etc., are determined by other tables similar to Table 1.
The encoded data (e.g., from Table 1) is used by selector 14 to generate partial sums. Selector 14 is typically arranged in two dimensional arrays, with a given multiplexer cell in the array receiving one set of the output signals from a particular booth encoder, and also receiving two bits of multiplicand operand A by one-bit step, i.e. A(0 . . . 1), A(1 . . . 2), A(2 . . . 3), and so on. For example, one multiplexer cell corresponding to multiplier bits C(0 . . . 2) uses multiplicand operand bits A(1) and A(2), and the four output signals “1x” “2x” “−1x” and “2x” from Table 1. The output signal for that particular multiplexer cell is determined by Table 2:
TABLE 2InputOutput−2x−1x1x2xS1000 A(2)0100 A(1)000000010A(1)0001A(2)Again, the output values in Table 2 are only for the signals representing the particular three multiplier input bits C(0 . . . 2) and two multiplicand input bits A(1 . . . 2), and similar tables will be applied for other multiplexer cells in selector 14.
The construction of multiplier 10 is “pipelined,” meaning that one component of the multiplier (e.g., Wallace tree 16) can be operating on a set of data from a first multiplication operation while other components near the front of the pipeline (e.g., Booth encoder 12 and Booth selector 14) are operating on a set of data from a subsequent multiplication operation, during the same processing cycle. While pipelining improves computational efficiency, Booth encoder 12 is still in the critical path of the multiplier dataflow, and can often slow down the effective clock frequency of the processor and increase computation latency. A bottleneck can result when the encoded data from a Booth encoder must be distributed to the multiplexer. This problem becomes more critical as the clock frequency increases, particularly above a few gigahertz.
Booth-encoder multipliers are also poorly fashioned in terms of physical layout, i.e., circuit design of an integrated circuit (IC) chip. As illustrated in the floor plan of FIG. 2A, a typical Booth encoder design puts a gap in the multiplier layout, creating wires of uneven length (in the illustrated design, the Booth encoder is adapted for double-precision floating-point operations, and utilizes 53 bits). An alternative layout illustrated in FIG. 2B has a column of Booth encoder cells at one side of the design; in this layout, the Booth encoder must further rely on additional transistor circuitry to drive long wires on the IC chip.
In light of the foregoing, it would be desirable to devise an improved Booth encoder multiplier which could reduce the latency between the encoding and multiplexer stages of the multiplier. It would be further advantageous if the multiplier could eliminate the need for a gain tree after the Booth encoding that is currently required to drive long interconnection wires.