The background description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent the work is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure.
Cryptographic systems for data communication and storage may use public key cryptography. In public key cryptography, data may be encrypted and decrypted with a pair of keys. For example, a public key and a private key may be used to encrypt and decrypt the data, respectively. It is extremely difficult to derive the private key from the public key. Cryptography systems that use public key cryptography include, for example, Rivest, Shamir, and Adleman (RSA) cryptography and elliptic curve cryptography (ECC) over prime fields.
In public key cryptography, plain text data is encrypted into cipher text using the public key at a first node. The cipher text is transmitted to a second node. At the second node, the cipher text is decrypted into the plain text data using the private key.
For example, to generate the public key (n, e) in RSA cryptography, two distinct prime numbers of similar bit length p and q are selected. A modulus n=pq is calculated and used for both the public and private keys. The integer e is selected based on 1≦e≦φ(pq), where e and φ(pq) share no divisors other than 1 (i.e. e and φ(pq) are co-prime), and φ corresponds to Euler's totient function. The private key (n, d) corresponds to de≡1 (mod φ(pq)).
To encrypt a message M into cipher text at the first node, the message M is changed to an integer 0≦m≦n using a padding scheme. The cipher text c is calculated according to c=me mod n and transmitted to the second node. At the second node, m is recovered according to m=Cd mod n, and M is recovered from m according to the padding scheme.
Accordingly, ECC and RSA cryptography each involve multiplication and division of large operands. For example, for c=me mod n, m is multiplied by itself e times, and reduced modulo n after each multiplication. In arithmetic computation, Montgomery multiplication is typically used when the modulus is large (e.g. 128 to 1024 bits).
For example, Montgomery multiplication of two integers X and Y with a modulo M (e.g. MM(X, Y, M)) and n bits of precision results in a number Z according to Z=MM(X, Y, M)=XY2−n mod M, where M is an integer between 2n−1 and 2n and n is an integer (e.g. typically n=[log2M]+1). In RSA cryptographic systems, M is the product of the prime numbers p and q as described above. In elliptic curve cryptographic systems over prime fields, M is a large prime number. Accordingly, M is an odd number.
Referring now to FIG. 1, Montgomery multiplication may be performed according to an example radix-2 Montgomery multiplication method 10. In the method 10, Si is a partial sum for loop i of n loops, and q (0 or 1) is selected such that S+XY+qM is divisible by 2. In each loop i, if S[i+1]+xi·Y is odd, then M is added to the result. Further, a shift register storing X (as a plurality of words xi) is shifted to the right by 1. Accordingly, the partial sum stored as S[n] after n loops is less than or equal to 2M.
The calculation in step 4 of the method 10 may be performed using a carry-save adder, which includes double registers to store partial sums and carries. Alternatively, large operands in step 4 may be split into smaller operands that can be processed using a pipelined systolic array. Adjacent processing elements (PE) interchange the least significant bit (LSB) of the partial sum or the carry in two processing cycles. Accordingly, the latency of one Montgomery multiplication is approximately 2n cycles, where n is the bit size of the operands. (See A. Tenca and C. K. Koc, “A scalable architecture for Montgomery multiplication,” CHES 99, Lecture Notes in Computer Sciences, 1717, pp. 94-108, 1999; A. Tenca, G. Todorov, and C. K. Koc, “High-radix design of a scalable modular multiplier,” CHES 2001, Lecture Notes in Computer Sciences, 2162, pp. 185-201, 2001; and A. Tenca and C. K. Koc, “A scalable architecture for modular multiplication based on Montgomery's algorithm,” IEEE Trans. Computers, 52(9), pp. 1215-1221, 2003).
The architecture can be optimized via pre-computing the partial sum using two possibilities for the most significant bit (MSB) from the previous PE. Accordingly, the latency can be reduced to n cycles with a marginal increase in area. (See M. Huang, K. Gaj, S. Kwon, and T. El-Ghazawi, “An optimized hardware architecture for the Montgomery multiplication algorithm,” Proc. 11th International Workshop on Practice and Theory in Public Key Cryptography, PKC 2008, Barcelona, Spain, pp. 214-228, March 2008). Regardless of which architecture is used, a large number of flip flops are required to perform the Montgomery multiplication. Consequently, Montgomery multiplier architecture requires a large chip area to accommodate the flip flops.