The present invention relates to automated multiplication, and more particularly to efficient automated multiplication that is especially well suited for multiplication of large numbers.
Hardware-implemented techniques for multiplying two numbers together are well known. In many processing system architectures, it is adequate to accomplish multiplication by iteratively instructing generic logic, such as an arithmetic logic unit (ALU), to perform suitable add and shift operations to generate the final product. However, it is often desirable to make available very fast multiplication operations, and to this end specialized multiplication logic is often provided. Such logic is often separate and apart from the central processing unit (CPU).
Hardware mapped multiplier units are very useful so long as the size (i.e., word length) of the input operands is comparable to size of the computational data paths for communicating those operands. However, in many applications (e.g., cryptographic algorithms) it is necessary to multiply together operands that are much larger than the size of the computational data path. In such cases, it is impractical to implement the desired multiplication using a hardware-mapped multiplication unit. Instead, one or both of the operands are broken up into parts, and the hardware data path is conventionally reused in a time-multiplexed fashion, operating on the parts, or words, of the input numbers. Hardware reuse is also the case for software implementation on standard microprocessors having a fixed word length data path.
The operation of carrying out a part of the multiplication for each word is denoted “partial product generation.” In order to have a fast execution time, the number of iterations is minimized by using a large word length (also denoted “high radix”) for the partial product generation. Unfortunately, higher radices imply longer carry chains and intermediate carry signals width larger word length, thereby slowing down operation and increasing power consumption. This can be seen from the following analysis:
A positive integer N-bit number a can then be written as a sequence of W-bit words ai as
  a  =            ∑              i        =        0                              N          /          W                -        1              ⁢                  ⁢                  a        i            ⁢                        2          Wi                .            The generalization to negative and fractional numbers is straightforward, but not included in the calculations for the sake of simplicity. The multiplication of two words, x=ab may be calculated by generating partial products from the W-bit words ai and bi, and combining the partial products. More specifically, the product x may be calculated according to
      x    =                  ∑                  i          =          0                                      N            /            W                    -          1                    ⁢                          ⁢                        ∑                      j            =            0                                              N              /              W                        -            1                          ⁢                                  ⁢                              a            i                    ⁢                      b            j                    ⁢                      2                          W              ⁡                              (                                  i                  +                  j                                )                                                          ,where the partial product, xi,j is generated from two W-bit numbers, ai and bj asxi,j=aibj.For a word length W, the partial products are simply calculated as W×W multiplications as indicated in the equation above. To calculate the complete product x=ab, all partial products are generated and added together according to their significance.
One partial product slice 101 is shown in FIG. 1. The rhombic shape is due to the significance of each of the partial product bits; significance increases when going from right to left in the figure. FIG. 2 is a diagram depicting how all of the required partial products are mathematically combined to generate the complete product. It is apparent from the figure that the computed result from one slice should be combined with the result from the neighboring slices to the left and right, and that these combination results are also accumulated with the values generated by the slices above and below.
The partial product xi,j is 2W bits wide, and is conventionally divided into two W-bit wide words, herein denoted carry (ci,j) and lower partial product (pi,j), asxi,j=2Wci,j+pi,j,orci,j=int(xi,j/2W) and pi,j=xi,j mod2W,where “int( )” is a function that generates the integer part of a number, and “mod n” indicates modulo n arithmetic.
Assume that it is desired to multiply two large numbers A and B, each of word length N, stored in a memory of word length W. Then, each number consists of N/W words (assuming that W is a factor of N). Let T be a storage area of 2N bits, or equivalently 2N/W words denoted t0, t1, . . . , t(2N/W)−1. T is used as a working storage area in which carry and lower partial product terms are accumulated until the final product, x, is generated. The final values for t0, t1, . . . , t(2N/W)−1 are efficiently generated from carry terms, previously-generated lower partial products, and interim values of t0, t1, . . . , t(2N/W)−1 as follows (where the symbol “: =” denotes a processing operation whereby already-existing (“old”) values of terms are combined as indicated on the right side of the symbol, with the result being assigned to the indicated “new” term on the left side of the symbol):
            t      0        :=          p              0        ,        0                        t      1        :=                  c                  0          ,          0                    +              p                  1          ,          0                                t      2        :=                  c                  1          ,          0                    +              p                  2          ,          0                      ⋮            t      1        :=                  t        1            +              p                  0          ,          1                                t      2        :=                  t        2            +              c                  0          ,          1                    +              p                  1          ,          1                                t      3        :=                  t        3            +              c                  1          ,          1                    +              p                  2          ,          1                      ⋮That is, operations take place in a right-to-left, top to bottom order, starting with the horizontal direction first as illustrated in FIG. 3. For example, in FIG. 3 it can be seen that the word b0 is first applied against each of the words a0 . . . a(N/W)−1 to generate corresponding lower partial products p0,0 . . . P(N/W)−1,(N/W)−1, carry terms c0,0 . . . c(N/W)−1, (N/W)−1, and words t0 . . . t(N/W)−1 before the next word b1 is applied against the words a0 . . . a(N/W)−1, and so on.
FIG. 4 is a logic diagram illustrating conventional logic of an exemplary row 301 for implementing multiplication as illustrated in FIG. 3. The first row 303 can be considered a special case in which the values of t0 . . . t(N/W)−1 have each been initialized to zero. If the first row 303 is physically implemented by logic as depicted in FIG. 4, it can be efficiently realized by merely omitting the tk inputs from each indicated adder (k=0 . . . (N/W)−1).
FIG. 5 is a logic diagram of a generic one of the conventional partial product generators illustrated in FIG. 4. Mathematically, the outputs from the partial product generator are related to the inputs as follows:2WcOUT+tNEW=tOLD+pi,j+cIN,where Pi,j is a lower partial product. (It will be noted that, in order to ease the notational burden in this description, the carry term supplied to a partial product generator is henceforth referred to as “cIN”, and the carry term provided as an output from the partial product generator is henceforth referred to as “cOUT”.) While tNEW and tOLD may be maintained separately, in practical embodiments it is often most efficient to maintain a single value of t in storage, with tOLD being the value read out of storage, and tNEW being the value to be written back.
It will now be shown how this expression can be used to derive the minimum word length of the carry signal. Since pi,j=aibj, and ai, bj≦2W−1, it follows that:pi,j≦(2W−1)2.Furthermore, the word length of t is W bits, and thus t≦2W−1. Thus, if we collect the carry terms on the left side of the relationship, and collect the t terms on the right side of the relationship, we find that2WcOUT−cIN=tOLD−tNEW+Pi,j.The right side of the equation can be set to its maximum value by letting tNEW be set to zero (i.e., its minimum value), and by letting tOLD and pi,j each be set to their respective maximum values. This yields the following relationship:
                              2          W                ⁢                  c          OUT                    -              c        IN              ≤                  (                              2            W                    -          1                )            +                        (                                    2              W                        -            1                    )                2              ⁢          ⁢          =            2      W        ⁢          (                        2          W                -        1            )      Since cIN is, by definition, greater than or equal to zero, and since the relationship must be true for all values of cIN (i.e., including cIN=0), it can be concluded that cOUT≦(2W−1). Furthermore, the word length of the carry in signal is the same as the carry out signal. Therefore,cIN,cOUT≦2W−1.
From the previous discussion, two statements can be made regarding the shown radix-2W approach:
1. All data words, including carry signals, are W bits wide.
2. The carry propagate chain for the radix-2W partial product generator approach is 2W bits long.
The length of the carry propagate chain sets the upper limit on the speed of a partial product generator implementation, and the size of the propagated carry sets the limit on the maximum required word length of the data path.
It is common to increase multiplication speed by using modified Booth encoding, Wallace adders to compress the number of partial products, and faster addition schemes for carry propagation summation. Booth encoding is discussed in A. D. Booth, “A signed binary multiplication technique,” Quarterly Journal of Mechanics and Applied Mathematics, vol. 4, pp. 236-240, 1951; and in L. P. Rubinfield, “A proof of the modified Booth algorithm for multiplication,” IEEE Transactions on Computers, October 1975, both of which are hereby incorporated herein by reference. Wallace adders are discussed in C. Wallace, “A suggestion for a fast multiplier,” IEEE Transactions on Electronic Computers, vol. EC-13, February 1964, which is hereby incorporated herein by reference.
The choice of radix for the partial product generation implementation depends on a number of factors, mainly including constraints on clock frequency, area, available data word length, and latency. To have a fast and area-efficient partial product generation, the word length, or radix, has to be limited. A restricted word length results in a larger number of partial products, which takes more time to add together when producing the full word length product. Thus, the choice of a radix for the partial product generator results in a sub-optimal solution.
The use of Booth encoding, or other means to speed up partial product generation, may speed up calculation of the actual partial product, but the word length of the intermediate carry signal remains the same, thus not improving the time required for addition of the partial products.
It is therefore desirable to provide improved methods and apparatuses for multiplying large numbers together.