1. Field of the Invention
The present invention relates to efficient modular exponentiation, and more particularly, to an improved apparatus and method for efficient computation involved in Montgomery multiplication.
2. Description of Related Art
In the field of applied cryptography, one of the most important arithmetic operations for public-key cryptography is exponentiation. It is required in the widely used RSA algorithm, as well as in the ElGamal encryption scheme and the Diffie-Hellman key agreement. In these applications, the form of exponentiation used is modular exponentiation, i.e., ax mod m is computed for some integers a, x, and m. This computation can be accomplished by forming powers of a and multiplying some of them together to obtain ax. The desired result is then obtained by finding the remainder when divided by m. Multiplication is the operation used in both forming the powers of a and in multiplying them together. However, since only the remainder when divided by m is desired, a full multiplication is not required. Only enough information needs to be maintained during multiplication in order to obtain the final remainder. To that end, the scheme known as the Montgomery multiplication is often used.
The algorithm for Montgomery multiplication, as given in the Handbook of Applied Cryptography, algorithm 14.36, is:                INPUT: integers m=(mn−1 . . . m1m0)b, x=(xn−1 . . . x1x0)b, y=(yn−1 . . . y1y0)b, with 0≦x, y<m, R=bn with gcd(m, b)=1, and m′=−m−1 mod b        OUTPUT: xyR−1 mod m.        1. A←0. (Notation: A=(anan−1 . . . a1a0)b.)        2. For i from 0 to (n−1) do the following:                    2.1 ui←(a0+xiy0)m′ mod b.            2.2 A←(A+xiy+uim)/b.                        3. If A≧m then A←A−m.        4. Return A.        
In computer implementations, the value of b is always a power of 2. That is, b=2k for some integer k. Then, step 2.1 requires two k-bit multiplies, one to compute xiyo and the other to multiply by m′. Step 2.2 requires 2n multiplies, n to multiply y by xi and another n to multiply m by ui. Thus, the bulk of the work is performed in step 2.2. To carry out a multiply, the bits are combined together with the “AND” operation, thereby forming partial sums. Then, these partial sums are added together. The multiplication of y by xi creates k partial sums and the multiplication of m by ui creates another k partial sums. These 2k partial sums are added together with the previous value of A to obtain the new value of A. It should be noted that the value of A has an additional k bit digit compared to that of m or n. In this particular example, A has n+1 number of digits, while m and n each has n number of digits. Although the above notation indicates that A is a base b value, for the purposes of addition, each digit ai is k binary bits. So A is just one more term to be included in the addition. The addition of these 2k+1 terms can be done with a Wallace tree, resulting in two terms which are then added with a carry look-ahead adder to obtain the new value of A. It should be noted that the division by b is just a shift of k bits.
In the case where b=2 (that is, k=1), the algorithm becomes:                INPUT: integers m=(mn−1 . . . m1m0)2, x=(xn−1 . . . x1x0)2, y=(yn−1 . . . y1y0)2, with 0≦x, y<m, R=2n with m odd, and m′=1.        OUTPUT: xyR−1 mod m.        1. A←0.        2. For i from 0 to (n−1) do the following:                    2.1 ui←a0+xiy0 mod 2.            2.2 A←(A+xiy+uim)/2.                        3. If A≧m then A←A−m.        4. Return A.        
Step 2.1 is very easy, as the computation of xiy0 is only an “AND” operation. This is followed by the addition of a0 mod 2, which is just an “XOR” operation. Step 2.2 is the “ADDITION” of three values, the particular “ADD” operation chosen from one of the following, depending on the xi and ui values:
“ADD” A+y+m if xi=1 and ui=1.
“ADD” A+y+0 if xi=1 and ui=0.
“ADD” A+0+m if xi=0 and ui=1.
“ADD” A+0+0 if xi=0 and ui=0.
Since neither y nor m changes during the execution of this algorithm, y+m may be computed once before starting, and so step 2.2 becomes just selecting which of y+m, y, m, or 0 is to be added to A, adding the two values, then shifting this result by one place. Thus, when b=2, the bulk of the computation time in this algorithm is the carry look-ahead addition of the two values in step 2.2. As well known in the prior art, complete addition requires a delay in computational time because each digit of the sum can be determined only after the carry from the previous digits has been computed. Although carry look-ahead methods obtain all of the carries with a parallel technique, the time required is order log n, where n is the number of bits to be added. Even though order log n time is much faster than linear time (order n), it is slower than constant time.
Therefore, currently employed conventional computational methods that involve the carry look-ahead addition precludes the efficient utilization of current computer processor power. Although current computers are able to carry out many complex computations simultaneously, as described above, the carry look-ahead addition slows down the speed of iterations involved in Montgomery multiplication by quite a bit. It would thus be highly desirable if the addition of step 2.2 could be carried out in constant time rather than in order log n time. Constant time processing would enable much quicker computation and provide an efficient method of Montgomery multiplication.