Modular arithmetic for very large moduli is a staple of public-key cryptographic systems including Rivest, Shamir, and Adleman (RSA: U.S. Pat. No. 4,405,829), Diffie-Hellman (DH: U.S. Pat. No. 4,200,770), the Digital Signature Algorithm (DSA: U.S. Pat. No. 5,231,668), Elliptic Curve Cryptosystems (ECC), and the like. Because these systems perform modular exponentiation, typically via a square and multiply method, modular multiplication is a critical bottleneck that has been addressed through dedicated circuits in prior art with limited success.
Residual modular multiplication (Montgomery, P. L., “Modular multiplication without trial division,” Math. Comput., vol. 44, no. 170, pp. 519-521, April 1985.) depends on a predetermined parameter R that is relatively prime to the modulus n. We say that a′ is a residual representation of a (with respect to R and n) if a·R is congruent to a′ modulo n. Residual modular multiplication concerns computing a residual representation of a·b given residual representations of a and b. In other words, given a′ and b′, residual modular multiplication produces a value u for which u·R is congruent to a′·b′ modulo n. The algorithm of Montgomery accomplishes this by eliminating the calculation of a remainder modulo n in favor of a division by R and a remainder with respect to R. By choosing R to be a suitable power of 2, the division becomes a shift operation and the remainder a mask operation. Furthermore, the basic algorithm scales up with numbers represented in a predetermined radix r and predetermined length L, so that R=rL.
Previous implementations of Montgomery multiplication have been developed for hardware, e.g., Takahashi (U.S. Pat. No. 6,973,470), Chen (U.S. Pat. Nos. 6,763,365 and 6,963,977), Monier (U.S. Pat. No. 6,035,317, European Patent 2,679,054), McGregor (U.S. Pat. No. 6,598,061) and Higashi (U.S. Pat. No. 6,151,393), Son (U.S. Patent Application 2004/0167952), Rarick (U.S. Patent Application 2004/0125948), Foster (U.S. Pat. No. 6,182,104). In addition, the recent work of McIvor et al, (C. McIvor, M. McLoone, and J. V. McCanny. “Modified Montgomery modular multiplication and RSA exponentiation techniques,” IEE Proc.—Comput. Digit. Tech., vol. 151, no. 6, pp. 402-408, November 2004) specifically targets FPGA hardware. Additional implementations targeted towards specific platforms have also been created (A. Daly and W. Marnane. “Efficient architectures for implementing Montgomery modular multiplication and RSA modular exponentiation on reconfigurable logic,” Proceedings of the 2002ACM/SIGDA tenth international symposium on Field-programmable gate arrays, Monterey, Calif., Feb. 24-26, 2002; T. Blum and C. Paar. “High-Radix Montgomery modular exponentiation on reconfigurable hardware,” IEEE Transactions on Computers, vol. 50 no. 7, pp. 759-764, July 2001).
In prior art, these solutions can be classified into three sorts, depending on the assumed radix of basic operations. First, several solutions take the radix to be 2 (that is, bitwise operations dominate). In these solutions, the number of iterations needed to compute a result is the bit-width of the Montgomery multiplier itself. Though these solutions typically avoid carry propagation during iteration by some form of carry-redundant representation of partial results, in an application to cryptography, K(the input bit length) can be 1024 or much more. So the timing overhead of requiring K iterations is substantial. Second, several solutions assume an underlying word size (w) for which the hardware supports w×w bit multiplication. These solutions cut the number of outer iterations down to K/w. And yet, each iteration requires at least 2K/w separate multiplications because two K×w bit sub-products are computed. In some solutions, these sub-products are also handled with carry-redundant representations, but computation of the sub-products themselves requires carry propagations. Third, some solutions take the radix to be 4, using either a form of Booth multiplication to eliminate redundant additions or pre-computed sub-products. While these halve the number of iterations, they incur significant overhead in preparatory calculations.
In the solutions known in prior art, it appears that complete elimination of carry propagation during the main loop is incompatible with exploiting hardware multipliers to reduce the number of iterations needed. If the hardware multipliers are used to produce full sub-products, then carry propagation will be incurred at least during the sub-product computation. Many mid-level to upper-level Field Programmable Gate Arrays (FPGAs) support a plurality of at least eight 64×64 bit multipliers operating in parallel. Similarly, with standard design tools, Application Specific Integrated Circuits (ASICs) and custom circuit implementations can easily support parallel hardware multipliers. None of the prior art solutions are able to exploit these resources.
The standard definition of the Montgomery algorithm requires that the input multiplier and multiplicand both be in reduced form with respect to the modulus. That is, both must be less than the modulus. Likewise, in most implementations the algorithm guarantees the result is also in reduced form. This guarantee is kept by comparing n to a result that is known to be less than 2n, and subtracting n if necessary. In the context of most applications including Montgomery exponentiation, these requirements are unnecessary. Except for the final step, it suffices (for correctness) to have all intermediate values as residual representations without regard for their size. If an implementation can guarantee a bound on the bit width of intermediate values, then an expensive comparison to the modulus can be avoided. In this invention, the comparison to n is eliminated, yielding a significant area savings and speed up.