1. Field of the Invention
The present invention relates to a multiple-word multiplication-accumulation (MAC) circuit and a Montgomery modular MAC circuit. More particularly, the present invention relates to a multiple-word MAC circuit suitable for execution of modular multiplication and accumulation for the Montgomery algorithm, as well as to a Montgomery modular MAC circuit based on the same.
2. Description of the Related Art
Recent years have seen a rapid growth of online trade market, also known as electronic commerce (e-commerce), where commercial transactions involving money transfers take place over a network. People are exchanging personal information such as their credit card numbers over a network more frequently than ever before. Such important personal information has to be protected from eavesdropping and tampering attacks by a malicious third party. The use of cryptographic techniques is thus mandatory to ensure the security of information transfer in e-commerce.
Public-key cryptosystems, one of the modern cryptographic algorithms, use a pair of cryptographic keys called public and private keys. The sender encrypts his/her message with the receiver's public key, and the receiver decrypts the received message with his/her own secret private key. Suppose, for example, that a person is purchasing a product at an online store. The online store's server sends their public key to the purchaser, allowing him/her to send his/her credit card number and other information in encrypted form. The store can decode the received information by using their private key. The advantage of this system is that the public key is made available literally to the public. That is, public-key cryptosystems permit us to achieve secure communication with anyone who have publicized their encryption keys.
One example of a public-key algorithm is RSA, named after its three creators: Ron Rivest, Adi Shamir, and Leonard Adleman. The RSA cryptosystem uses a modular multiplication process to ensure the secrecy of ciphertext, relying on the difficulty of prime factorization of a large integer. That is, when a certain number x and an integer n are given, it is relatively easy for a computer to calculate a power of x modulo n, where “modulo” is the operator to calculate the remainder of division by n. But, because of the difficulty of prime factorization, it is very hard to accomplish the reverse process when n is very large, meaning that the original number x cannot be reproduced easily. RSA is grounded on this nature of modular arithmetic.
RSA, however, requires a larger amount of computation for modular multiplication than symmetric cryptosystems such as the Data Encryption Standard (DES), and this fact leads to demands for a faster algorithm. Montgomery modular multiplication method is one of the solutions for reducing computational burden. When a radix R coprime to an integer N is selected such that R>N, the Montgomery algorithm computes T×R−1 mod N (i.e., remainder of division of T×R−1 by integer N) from an input value T satisfying 0≦T≦R×N, where modulus N is represented as an irreducible polynomial of degree N. R−1 denotes the multiplicative inverse element of radix R, modulo N; i.e., R and R−1 satisfy R×R−1=1. The algorithm achieves “modulo N” computation by repeating addition, multiplication, and shift operation, without the need for time-consuming division, which requires repetitive subtractions. This is a great advantage particularly when the integer N is very large. For details, see: P. L. Montgomery, “Modular Multiplication without Trial Division,” Mathematics of Computation, Vol. 44, No. 170, pp. 519-521, 1985.
More specifically, with input values A, B, C, R, N, and ND, the algorithm proceeds in the following way:T=A×B+C  (1)M=T×ND mod R  (2)X=(M×N+T)/R  (3)Y=X−N  (4)where ND satisfies the condition of R×R−1−N×ND=1, and T and M are variables representing intermediate variables. X and Y are output values, where X will be selected as the final outcome of the computation when Y>0, and Y will be selected when Y≦0. Each input value A, B, C, and N is given in the form of multiple-word data with a length of, for example, 2048 bits and stored as a series of data words in a certain memory area. Radix R is a power of 2 (e.g., 22048), and the memory stores its exponent only (2048 in this case), instead of the value of R per se. Essentially, the input value ND is multiple-word data like other input values. However, only a limited number of bits of ND are required in actual computation, depending on the bit width of the multiplier used. The reason for this will be discussed later.
The first equation (1) is a sum-of-product operation, which is usually implemented as iterations of (d,e)=a×b+c+d by using a multiplier-accumulator (MAC operator) with a fixed operation data width (e.g., 32 bits), where d and e represent upper bits and lower bits of the result obtained at each iteration, and where d is fed back to the input side for use in the subsequent iteration. More specifically, the following shows the algorithm for executing the first equation (1):
(1.1)for (i=0 to s−1){(1.2)a:=A[i](1.3)d:=0(1.4)for(j=0 to s−1){(1.5)b:=B[j](1.6)if (i==0) c=C[j](1.7)else c=T[i+j](1.8)(d,e):=a×b+c+d(1.9)T[i+j]:=e(1.10)}(1.11)T[i+s]:=d(1.12)}This process includes a double loop structure with loop variables i and j, in which an inner loop of lines 1.4 to 1.10 is nested within an outer loop of lines 1.1 to 1.12, so that the iterative calculation will proceed from lower bits to upper bits. The symbol s in lines 1.1, 1.4, and 1.11 represents a constant that is obtained by dividing data length by word length. The brackets [ ] are used to point at a particular data word of a variable, counted from its least significant word. For example, A[0] represents the bottom 32 bits of variable A in the case the word length is 32 bits.
FIG. 11 is a block diagram of a conventional multiple-word MAC circuit. This multiple-word MAC circuit 900 is composed of a memory 901, a MAC operator 902, and registers 903, 904, 905, 906, and 907. The memory 901 stores given multiple-word input values A, B, and C for computation. Those input values A, B, and C are read out of the memory 901 and supplied to the MAC operator 902 word by word through registers 903 and 904, and 905. The MAC operator 902 calculates A×B+C, the result T being written back to the memory 901 via a register 907.
The conventional multiple-word MAC circuit 900 of FIG. 11 executes the foregoing algorithm as follows. Suppose that there is a control circuit (not shown) that produces loop variables i and j and supplies them to the memory 901 as its address input. Input values A[i], B[j], and C[j] are arrays of words, each word being W bits in width. The multiple-word MAC circuit 900 reads a word of each input value A[i], B[j], and C[j] out of the memory 901, where T[i+j] may be read instead of C[j]. Those A[i], B[j], and C[j] are set to the registers 903, 904, and 905 as new data to the MAC operator inputs a, b, and c, respectively. The MAC operator 902 calculates (d,e):=a×b+c+d (i.e., executes line 1.8) and sends the results d and e to their corresponding registers 906 and 907. The former register 906 allows d to be fed back to the MAC operator 902 as an input for the next cycle, while the latter register 907 supplies e to the memory 901 to substitute for T[i+j].
The conventional multiple-word MAC circuit 900 executes the foregoing algorithm in the way described above. This process involves reading of input values b and c from the memory 901 and writing of an output value e to the same in each iteration of the inner loop, after reading the multiplicand a at the beginning (line 1.2) of the outer loop. In order to run the MAC operator 902 at the full clock speed without stopping the pipeline, the memory 901 is required to have the ability of transferring three words per clock cycle. The use of a multiport memory device or a plurality of separate memory devices is recommended as a solution for giving sufficient bandwidth to the memory 901. See, for example, the Japanese Patent Application Publication No. 2002-207589, FIG. 1.
Referring back to the algorithm of (1) to (5), the third equation (3) is executed basically in the same way as (1). The division by R can be accomplished by simply shifting the dividend to the right since the divisor R is a power of 2. The first three equations (1), (2), and (3) may be processed one by one, but this approach is not necessarily efficient since the multiple-word value ND has to be calculated in its full length. One improved processing technique in this respect is described in . K. Koç, “High-Speed RSA Implementation,” Technical Report TR 201, RSA Laboratories, Version 2.0, November, pp. 48-49, 1994. According to the algorithm proposed in this paper, the first equation (1) calculates an intermediate variable T step by step, from the bottom bits to the top bits. Each time a new value of T comes out, the process executes equation (2) with respect to the bottom word of that value of T, thus determining the bottom word of another intermediate variable M. With those partial values of T and M, the process proceeds to equation (3) to calculate a corresponding part of output value X. While ND in equation (1) is essentially a multiple-word value, the improved algorithm requires not all words together, but only one word at a time.