Modular multiplication and modular exponentiation are important operations in many cryptographic systems. Modular multiplication involves finding a product c=ab and then dividing the product c by a modulus M to find a remainder that is referred to a modular product. The result of modular multiplication of a and b performed modulo-M is generally written as c≡ab mod M. The modular multiplication operation is also used to perform modular exponentiation.
Modular multiplication and exponentiation are used in the Diffie-Hellman and RSA public-key cryptosystems, described in, for example, W. Diffie and M. E. Hellman, “New Directions in Cryptography,” IEEE Trans. on Information Theory, vol. 22, pp. 644–654 (1976), and R. L. Rivest, A. Shamir, and L. Adelman, “A Method for Obtaining Digital Signatures and Public-key Cryptosystems,” Communications of the ACM, vol. 21, pp. 120–126 (1978). Modular multiplication is also used in elliptic key cryptography over the finite field GF(2k) and in discrete exponentiation over GF(2k). These applications are described in C. K. Koc and T. Acar, “Fast Software Exponentiation in GF(2k),” in T. Lang, J. -M. Muller, and N. Takagi, eds., Proceedings, 13th Symposium on Computer Arithmetic, pp. 225–231 (Asilomar, Calif., Jul. 6–9, 1997).
While obtaining a product c=ab can be fast and efficient, the division by M used to obtain a modular product is slow and inefficient. One method of improving the speed and efficiency of modular multiplication (and exponentiation) is known as Montgomery multiplication and was first described in P. W. Montgomery, “Modular Multiplication Without Trial Division,” Math. of Computation, vol. 44, pp. 519–521 (1985). Montgomery multiplication is particularly suitable for implementation on general-purpose computers, dedicated signal processors, or microprocessors. The method is based on a particular representation of the residue class modulo-M, and replaces the division by M operation with division by a power of 2. This latter operation is easily accomplished on a computer since numbers are typically represented in binary form and the division operation is a simple bit-shift operation. Applications of Montgomery multiplication are described in, for example, T. Hamano, N. Takagi, S. Yajima, and F. P Preparata, “O(n)-Depth Circuit Algorithm for Modular Exponentiation,” in S. Knowles and W. H. McAllister, eds., Proceedings, 12th Symposium on Computer Arithmetic, pp. 188–192 (Bath, England, Jul. 19–21, 1995), and ÇK. Koç and T. Acar, “Fast Software Exponentiation in GF(2k)” cited above.
In Montgomery multiplication, integers x and y that are elements of a complete residue set mod M are transformed to respective Montgomery images X and Y according to the transformation A≡ar mod M, wherein uppercase A denotes the Montgomery image of lowercase a and r is an integer such that gcd(r, M)=1. The Montgomery images X and Y are multiplied according to the Montgomery method to obtain a Montgomery product Z≡XYr−1 mod M. A result z is then obtained by transforming the Montgomery image Z back into the complete residue set.
Montgomery multiplication is typically performed using a radix-2 algorithm in which the radix r=2n. For m-bit operands X=(xm-1, . . . x1, x0), Y, and a modulus M, a pseudocode representation of the radix-2 algorithm is:                S0=0        for i=0 to m−1                    if (Si+xiY) is even                            then Si+1:=(Si+xiY)/2                else Si+1:=(Si+xiY+M)/2                                                if Sm≧M then Sm:Sm−M,wherein the operands X and Y are Montgomery images of integers x and y. (Note that subscripted variables xi refer to bits of the Montgomery image X while the unsubscripted variable x denotes an element of the complete residue set.) This algorithm is adequate for hardware implementations because it is composed of simple operations such as word-by-bit multiplication, bit-shift (division by 2), and addition. The test of the even condition is also simple, consisting of checking the least significant bit of the partial sum Si+xiY to determine if the addition of M is required. However, the operations are performed on full precision of the operands, and once hardware is defined for the m bits, the hardware does not work for operands having larger numbers of bits.        
Given two integers X and Y (Montgomery images of x and y, respectively), the application of the radix-2 Montgomery multiplication (MM) algorithm with required parameters for n bits produces a Montgomery product Z:Z=MM(X,Y)=XYr−1 mod M,  (1)wherein r=2n, and M is an integer in the range 2n-1<M<2n. For cryptographic applications, M is usually a prime number or the product of two primes, and the condition that r and M be relatively prime, i.e., gcd(r, M)=1, is always satisfied.
Because the Montgomery method does not require division by M, the Montgomery method can be efficient, especially for operations in which repeated transformations to and from the complete residue set and the Montgomery images are not needed. Thus, Montgomery multiplication is especially attractive for exponentiation in which multiple Montgomery products are computed before transforming a result back to the complete residue set.
Various improvements to Montgomery multiplication have been suggested that offer increased efficiency in either hardware or software implementations. Some examples are described in H. Orup, “Simplifying Quotient Determination in High-radix Modular Multiplication,” in S. Knowles and W. H. McAllister, eds., Proceedings, 12th Symposium on Computer Arithmetic, pp. 193–199 (Bath, England, Jul. 19–21, 1995); ÇK. Koç, T. Acar, and B. S. Kaliski Jr., “Analyzing and Comparing Montgomery Multiplication Algorithms,” IEEE Micro, vol. 16, pp. 26–33 (1996); A. Bernal and A. Guyot, “Design of a Modular Multiplier Based on Montgomery's Algorithm,” in 13th Conference on Design of Circuits and Integrated Systems, pp. 680–685 (Madrid, Spain, Nov. 17–20, 1998); S. E. Eldridge and C. D. Walter, “Hardware Implementation of Montgomery's Modular Multiplication Algorithm,” IEEE Trans. Computers, vol. 42, pp. 693–699, June, 1993; and P. Kornerup, “High-radix Modular Multiplication for Cryptosystems,” in E. Swartzlander, Jr., M. J. Irwin, and G. Jullien, eds., Proceedings, 11th Symposium on Computer Arithmetic, pp. 277–283 (Windsor, Ontario, June 29–Jul. 2, 1993).
These improvements in Montgomery multiplication described in the above references generally require that the associated algorithms and hardware use limited precision operands or a high-radix. High-radix algorithms usually are complex and consume significant amounts of chip area, and circuits based on these algorithms do not necessarily provide a speed increase. A theoretical investigation of the design trade-offs for high-radix modular multipliers is given in C. D. Walter, “Space/Time Trade-offs for Higher Radix Modular Multiplication Using Repeated Addition,” IEEE Trans. Computers, vol. 46), pp. 139–141 (1997) and an example of a radix-4 design is described in A. Royo et al., cited above. The increase in the radix forces the use of digit multipliers, and therefore more complex designs and longer clock cycle times.
Multipliers designed for operands having a specific number of bits generally cannot be used for operands having a different number of bits. For example, a 768-bit multiplier described by A. Royo, J. Moran, and J. C. Lopez, “Design and Implementation of a Coprocessor for Cryptography Applications,” in European Design and Test Conference, pp. 213–217 (Paris, France, Mar. 17–20, 1997) cannot be used readily with operands having 1024 bits. Typically the functions performed in lower precision designs are not consistent with functions necessary for higher precision multiplication, and a new multiplier must be designed.
To make scalable hardware, a conventional solution uses software and standard digit multipliers. The algorithms for software computation of Montgomery multiplication are presented in ÇK. Koç and T. Acar, “Montgomery Multiplication in GF(2k),” Designs, Codes and Cryptography, vol. 14, pp. 57–69(1998), and ÇK. Koc, T. Acar, and B. S. Kaliski Jr., “Analyzing and Comparing Montgomery Multiplication Algorithms,” IEEE Micro, vol. 16, pp. 26–33 (1996). The complexity of software-oriented algorithms is much higher than the complexity of the radix-2 hardware implementation, and direct hardware implementations are unattractive.
For these reasons, improved Montgomery multiplication methods and apparatus are needed that provide scalability without increasing multiplier cost, execution time, or complexity.