1. Field of the Invention
The present invention relates to high performance digital arithmetic circuitry, and more particularly to a high-radix multiplier-divider implemented in hardware for efficiently computing combined multiplication and division operations, e.g., computing ((A·B)÷D), as well as modulo multiplication, e.g., (A·B) mod D.
2. Description of the Related Art
Many of the recent computer intensive applications, e.g., multimedia and public-key cryptosystems, have been witnessing a continuous increase in demand for more computational power. For example, public-key cryptosystems make heavy use of the modulo multiplication operation, which comprises a multiplication operation together with a division/reduction operation. The key size of RSA public-key cryptosystems has been continuously getting larger, from 512 bits to 1024 bits, and most recently to 2048 bits, causing increased demand for more computational power. There exists a quite extensive literature that describes the theory and design of high speed multiplication and division algorithms. Division algorithms are divided into five classes: (1) digit recurrence; (2) functional iteration; (3) very high radix; (4) table lookup; and (5) variable latency.
Digit recurrence is the oldest class of high speed division algorithms, and, as a result, a significant quantity of literature has been written proposing digit recurrence algorithms, implementations, and techniques. The most common implementation of digit recurrence division in modern processors has been the Sweeney, Robertson, Tocher (SRT) method.
Digit recurrence algorithms retire a fixed number of quotient bits in every iteration. Implementations of digit recurrence algorithms are typically low complexity, utilize small area, and have relatively large latencies. The fundamental choices in the design of a digit recurrence divider are the radix, the allowed quotient digits, and the representation of the partial remainder (residue). The radix determines how many bits of quotient are retired in an iteration, which fixes the division latency. Larger radices can reduce the latency, but increase the time for each iteration. Judicious choice of the allowed quotient digits can reduce the time for each iteration, but with a corresponding increase in complexity and hardware. Similarly, different representations of the partial remainder (residue) can reduce iteration time, but with corresponding increases in complexity.
Digit recurrence division algorithms use iterative methods to calculate quotients one digit per iteration. SRT is the most common digit recurrence division algorithm. The input operands are represented in a normalized floating point format with w-bit significands in sign and magnitude representation. Assuming floating point representation, the algorithm is applied only to the magnitudes of the significands of the input operands. Techniques for computing the resulting exponent and sign are straightforward. The most common format found in modern computers is the IEEE 754 standard for binary floating point arithmetic. This standard defines single and double precision formats.
In a division operation (N/D), N is a 2 w-bit dividend, while D is a w-bit divisor. The division result is a w-bit quotient Q (Q=0·q−1 q−2 . . . q−n) and a w-bit remainder R such that N=QD+R and |R|<|D|. The w-bit quotient is defined to consist of n radix-r digits with r=2m, (w=n×m). A division algorithm that retires m quotient bits per iteration is said to be a radix-r algorithm. Such an algorithm requires n iterations to compute the result. For no overflow, i.e., so that Q is w-bits, the condition |N|<|D| must be satisfied when dividing fractions. The following recurrence is used in every iteration of the SRT algorithm;Rj=rRj-1−q−jD j=1,2,3, . . . , n; where                R0=N        D=divisor,        N=dividend, and        Rj is the partial residue at the jth iteration.One quotient digit (m-bits) is retired each iteration using a quotient digit selection function SEL where: q−j=SEL(rRj-1, D).        
After n iterations, the final value of the quotient Q and the remainder R are computed from Rn as follows:
                    if        ⁢                                  ⁢                  R          n                    >      0        ,                  then        ⁢                                  ⁢        Q            =                                    ∑                          j              =              1                        n                    ⁢                                          ⁢                                    q                              -                j                                      ⁢                          r                              -                j                                      ⁢                                                  ⁢            and            ⁢                                                  ⁢            R                          =                              R            n                    ⁢                      r                          -              n                                                      else      ⁢                          ⁢              (                  Q          =                                    ∑                              j                =                1                            n                        ⁢                                          q                                  -                  j                                            ⁢                              r                                  -                  j                                                                    )              -                  r                  -          n                    ⁢                          ⁢      where      ⁢                          ⁢              r                  -          n                    ⁢                          ⁢      is      ⁢                          ⁢      1      ⁢                          ⁢      in      ⁢                          ⁢      the      ⁢                          ⁢      least                  significant      ⁢                          ⁢      position        ,    or        Q    =                  (                              ∑                          j              =              1                        n                    ⁢                                    q                              -                j                                      ⁢                          r                              -                j                                                    )            -              ulp        ⁢                                  ⁢        where        ⁢                                  ⁢        ulp        ⁢                                  ⁢        designates        ⁢                                  ⁢        a        ⁢                                  ⁢        unit        ⁢                                  ⁢        in        ⁢                                  ⁢        the                  least    ⁢                  ⁢    significant    ⁢                  ⁢    position    ⁢                  ⁢    and        R    =                  (                              R            n                    +          D                )            ⁢                        r                      -            n                          .            
The critical path of the basic SRT digit recurrence algorithm comprises the following steps: (1) determination of the next quotient digit q−j using the quotient digit selection function, a look-up table typically implemented as a PLA, or read only memory (ROM); (2) generation of the product q−jD; and (3) subtraction of Q−jD from the shifted partial residue rRj-1. Each of these steps contributes to the algorithm cost and performance.
A common method to decrease the overall latency of the algorithm (in machine cycles) is to increase the radix r of the algorithm. Assuming the same quotient precision, the number of iterations of the algorithm required to compute the quotient is reduced by a factor f when the radix is increased from r=2m to r=2mf. For example, a radix-4 algorithm retires two bits of quotient in every iteration. Increasing to a radix-16 algorithm allows for retiring four bits in every iteration, halving the latency.
This reduction does not come for free. As the radix increases, the quotient-digit selection becomes more complicated and, accordingly, slower to compute. Since the quotient selection logic is on the critical path of the basic algorithm, using higher radices causes the total cycle time of the division iteration to increase. The number of cycles, however, is reduced for higher radices. As a result, the total time required to compute a w-bit quotient may not be reduced as expected. Furthermore, the generation of divisor multiples may become impractical or infeasible for higher radices. Thus, these two factors can offset some of the performance gained by using higher radices.
Typically, for a system with radix r, a redundant signed digit set (Da) is used to increase the performance of the algorithm. To be redundant, the size of the digit set should be greater than r, including both negative and positive digits. Thus, q−jεDα={−α, −α+1, . . . , −1, 0, 1, . . . , β−1, β}, where the number of allowed digits (α+β+1) is greater than r. It is fairly common to choose a symmetric digit set where α=β, in which case the size of the digit set (2α+1)>r, which implies that a must satisfy the condition α≧┌r/2┐. The degree of redundancy is measured by the value of the redundancy factor h, where h=α/r−1. Redundancy is maximal when α=r−1, in which case h=1, while it is minimal when α=r/2 (i.e., ½<h≦1).
For the computed Rj value to be bounded, the value of the quotient digit must be selected such that |Rj|<hD. Using larger values of h (i.e., large α) reduces the complexity and latency of the quotient digit selection function. This, however, results in a more complex generation of the divisor multiples. Divisor multiples that are powers of two can be formed by simple shifting, while those that are not powers of two (e.g., three) require additional add/subtract steps. The complexity of the quotient digit selection function and that of the generating divisor multiples must be balanced.
To define the quotient digit selection function, a containment condition is used to determine the selection intervals. A selection interval is the region defined by the values of the shifted partial residue (rRj-1) values and the divisor (D) in which a particular quotient digit may be selected. The selection interval is defined by the upper (Uk) and lower (Lk) bounds for the shifted partial residue (rRj-1) values in which a value of quotient digit qj=k may be selected to keep the partial residue Rj bounded. These are given by:Uk=(h+k)D andLk=(−h+k)D. 
The P-D diagram is a useful tool in defining the quotient-digit selection function. It plots the shifted partial residue (P=rRj-1) versus the divisor D. The Uk and Lk straight lines are drawn on this plot to define selection interval bounds for various values of k. FIG. 6 shows a P-D diagram for the case where r=4 and α=2(h=⅔). The shaded regions are the overlap regions where more than one quotient digit may be selected. Table I shows representative values of the upper and lower bounds Uk and Lk for the permissible values of the quotient digit k.
TABLE IUpper and Lower Bounds vs. Quotient DigitkUk = (h + k)DLk = (−h + k)D−2−4/3D  −8/3D−1−1/3D  −5/3D02/3D−2/3D15/3D  1/3D28/3D  4/3D
There is a need for a digital multiplier-divider unit that can efficiently compute
  S  =      (                  A        ·        B            D        )  where the multiplicand A, the multiplier B, and the divisor D are w-bit unsigned numbers. Computing S yields a w-bit quotient and a w-bit remainder R such that:A·B=Q·D+R and|R|<|D|. 
Conventionally, S would be computed using two independent operations: a multiplication operation, and a division operation. Whereas digit recurrence relations for these two operations have been proposed and are in common use by digital processors, no single recurrence relation has been proposed to simultaneously perform the multiplication and division operations as needed to efficiently compute
  S  =            (              AB        D            )        .  
Thus, a high-radix multiplier-divider solving the aforementioned problems is desired.