The security of many cryptographic algorithms lies in the mathematical difficulty in factoring large integer values (whole numbers with hundreds of decimal digits or more). Factoring a particular integer means determining the unique set of prime numbers that, multiplied together, form the particular integer. A prime number is a number that has as factors only the number itself and the number one.
Many cryptographic algorithms also employ modulo arithmetic in which intermediate and final results are expressed as an integer in the range from 0 to m−1 for a number m called a modulus. The modular reduction operation is here represented by the term “mod.” The modular reduction operation has two parameters, the modulus m and an integer a, and one result, the integer b such that a=b+k*m for some integer k. Effectively, the output b of the modular reduction operation is the remainder, or residue, of dividing the input integer a by the modulus m. If a is less than m, then b is the same as a. The modular reduction operation is herein expressed as “a modulo m equals b” and written asa mod m=bAlternatively, this is expressed as “a is equivalent to b modulo m” and written asa≡b [mod m]where [mod m] in square brackets indicates the immediately preceding number or variable is the output of the modulo operation. That is, the integer b always lies between 0 and m−1, whereas the integer a need not. The integer b is the residue of a modular reduction operation on the integer a and the modulus m. Other modular arithmetic operations commonly employed in cryptographic processing includes modular addition (the modular reduction of a sum of two integers), modular subtraction (the modular reduction of a difference between two integers), modular multiplication (the modular reduction of a product of two integers), modular division (the modular reduction of a quotient of a first integer divided by a second integer) and modular exponentiation (the modular reduction of a first integer raised to the power of a second integer).
Modular multiplication and exponentiation are often performed based on Montgomery's algorithm, well known in the art, and described in the article “Modular Multiplication without Trial Division,” by P. L. Montgomery, in Mathematics of Computation, v 44, n. 170, 1985, pp. 519-521.
Cryptographic processing systems can be implemented in software, but speed is often significantly increased by implementing some of the steps in special purpose hardware such as electronic circuits. Such hardware typically takes the form of an application specific integrated circuit (ASIC), a “chip,” which is composed of separate blocks of circuitry that each performs a certain combination of one or more steps of the computation. The blocks of circuitry are connected so that the output of one block is fed as input to another block. At many steps, a set of parallel connections between blocks is devoted to pass every binary digit (bit) of input and output during each processing cycle. Efficient, thoroughly tested, small footprint blocks have been developed for several modulo computations. Common circuit blocks employed in cryptographic processing systems include modular reduction (MR) blocks, modular addition (MA) blocks, modular subtraction (MS) blocks, modular multiplication (MM) blocks, modular division (MD) blocks and modular exponentiation (ME) blocks.
In designing and building circuits to perform cryptographic processing one often has to trade the size of the circuitry for latency. The size of the circuitry is often measured in number of fundamental components called gates. The latency is often measured in the number of processing cycles. A gate transforms an input set of one or more bits to an output set of one or more bits during each processing cycle. Chips with fewer gates that are reused in subsequent processing cycles require more processing cycles to complete processing and increase latency. Chips with more gates that can complete processing in fewer processing cycles are larger, cost more and consume more power than chips with fewer gates. As a consequence, there are many alternatives for the architecture of the individual blocks and the arrangement of multiple blocks in processing systems.
The number of gates on a block is also related to the maximum number of bits of the input to and output from the block during one processing cycle; the more bits the more gates. The blocks are usually designed for integers up to a certain maximum number of bits. For example, existing MR blocks use precision division or successive subtractions for a limited number of bits, typically 128 bits or fewer. The use of precision division or successive subtraction becomes unwieldy at larger input and modulus sizes, such as at 1024 bits and 2048 bits. The number of processing cycles used for successive subtractions increases with the difference between the number of bits for the large integer and the number of bits for the modulus. This difference can sometimes be quite large, on the order of 1000 bits.
For some cryptographic processing, the modular reduction is performed a few times on a very large integer with a number of bits much greater than existing MR blocks and more frequently on integers having a number of bits less than the maximum for existing MR blocks. An example cryptographic algorithm widely deployed is RSA invented by Rivest, Shamir and Adleman, and described in the reference Applied Cryptography, Protocols, Algorithms, and Source Code in C, by Bruce Schneier, 1996, John Wiley & Sons, New York (hereinafter referenced as Schneier). In this algorithm, the Chinese Remainder Theorem, well known in the art, is employed to break down a larger problem with a large modulus M, where M is equal to the product of two primes P1 and P2, into two smaller problems with the smaller moduli P1 and P2. The residue of large text T modulo P1, and the residue of T modulo P2, are needed (where T is the cipher text during decryption).
In current implementations, the smaller residues, e.g., T mod P1 and T mod P2, are used in subsequent processing steps that employ hardware designed to handle integers of the size of the residues, e.g., of the sizes of P1 or P2, but not of the size of the large integer, e.g., the size of T (also the size of M). Therefore the residues of the large texts are often computed in software and then passed as input to the hardware to continue the processing. The software computation of the residue is a performance hindrance.
Based on the foregoing, there is a clear need for an MR block that provides a smaller residue of a very large integer, which is not too costly in chip size and latency.
Furthermore, Montgomery multiplication modulo modulus m involves a factor called a Montgomery Constant that depends on m. In a past approach, the Montgomery Constant is computed in software for each modulus involved in the cryptographic processing and stored in one or more registers on the cryptographic processing chip. In the RSA algorithm, three moduli (M, P1, P2) are used for each private-key-public-key pair, so that three Montgomery Constants have to be determined for the three moduli and stored in three registers on the chip, consuming valuable chip area to support a large number of key pairs. Assuming use of 4,000 key pairs, which is reasonable for a practical implementation, the memory required to store the three Montgomery Constants (M, P1, P2) is approximately 12 megabits, excluding other pre-calculated constants.
Other cryptographic processing algorithms that compute Montgomery Constants include Diffie-Hellman key generation and the Digital Signature Algorithm (DSA), both well known in the art and described in Schneier. To support multiple key pairs, multiple sets of three registers can sometimes be involved, consuming even more valuable area on the chip. For example, in the Ephemeral Diffie-Hellman key pair generation algorithm, well known in the art, the moduli can possibly change for each secret key generation. In this algorithm, the constants cannot even be pre-computed at all, but are necessarily computed after initiation of each exchange sequence.
Based on the foregoing there is a clear need for computing Montgomery Constants as needed for Montgomery multiplication in MM and ME blocks, so that the number of registers on the chips to store Montgomery Constants can be reduced without excessively increasing latency.
Based on the foregoing, there is also a clear need for a cryptographic processing system that both computes Montgomery Constants as needed and provides hardware components for modular reduction of very large integers without excessively increasing latency.
The past approaches described in this section could be pursued, but are not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, the approaches described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.