1. Field of the Invention
The present invention relates to a modular multiplication calculation apparatus performing modular multiplication calculation in high speed by using a Montgomery's algorithm, for example, in RSA cryptographic processing for use in a public key cryptographic system.
2. Description of the Related Arts
The development of computer networks in recent years brings about a rapid increase in the chance of searching databases and that of transmitting and receiving digitized information such as electronic mails and electronic news. Furthermore, the provisions of services such as the online shopping are on the increase by using those pieces of information. Associated with such services, what have been pointed out are the problems of eavesdropping or tampering with digitized data available on a network and of receiving a service free or charge illegitimately by impersonating as a third person. Particularly in a network utilizing a radio frequency, a countermeasure to prevent these problems is strongly desired because eavesdropping is easy in such a network.
As a counter measure to these problems, an encrypted electronic mail and user authentication system, to both of which a cryptographic technique is applied, are proposed and introduced in various networks.
In this context, the encryption is deemed to be an indispensable technique in a computer network. One of such cryptographic techniques is a public key cryptographic system. What is desired for the public key cryptographic system is high speed processing because it requires a large volume of arithmetic operation processing for encryption and decryption, and various high-speed algorithms have accordingly been announced.
Encryption systems can generally be categorized into two systems, i.e., a secret key cryptographic system and a public key cryptographic system. The secret key cryptographic system is the system performing a cryptographic communication by a transmitter and a receiver both having the same cryptographic key. That is, in the secret key cryptographic system, the transmitter sends a certain message to the correspondent by encrypting it on the basis of a secret cryptographic key so that the receiver obtains information by reverting the cryptographic sentence back to the original message by using the cryptographic key. The public key cryptographic system is the system in which a transmitter transmits a message by encrypting it by using an open public key of a receiver so that the receiver decrypts the encrypted message by using her/his own secret key, thereby carrying out communications.
That is, in the public key cryptographic system, a public key is the key for encryption, while a secret key is the key for decrypting a cryptograph encrypted with a public key, and a cryptograph encrypted with the public key can be decrypted only by a secret key.
In the secret key cryptographic system, the number of keys which a person has to keep secretly is the number of correspondents and therefore the number of keys necessary for a network constituted by “n” persons is n(n−1)/2. Further, the system has a shortfall that a secret key has to be sent to the first-time correspondent by some means.
In order to eliminate the shortfall, a large scale network employs a method for equipping a key management center in the network, keeping only a secret key to be used between a person (i.e., a constituent of the network) and the center and obtaining, from the center, a secret key to be used between the constituent and correspondent when carrying out a cryptographic communication.
Meanwhile, in the public key cryptographic system, a key which a person is supposed to keep secretly is only her/his own secret key and therefore the number of keys required for a network constituted by “n” persons is n pieces.
Further, it requires only a public key to be sent to the first-time correspondent, and the used method is that a key management center is provided so as to register an “n” number of the public keys for users is registered in a public list and to obtain the public key of a correspondent from the center.
In this case, the center is only required to prevent unauthorized alteration of the public key and to not keep it secretly. The public key cryptographic system, however, has a larger number of bits than that for the public key cryptographic system, hence requiring a larger file size necessary to keep the key. Further, when performing authentication in the secret key cryptographic system, the adopted method is to compression-convert, for example, a message to be transmitted by a secret key, attach it to a sending message and send them together; and for the reception side likewise to compression-convert and compare them.
The problem is that the same key is used for transmission and reception, allowing the receiving person (“receiver”) to alter the authentication data. In contrast, the public key cryptographic system allows only the sender encryption with a secret key, hence preventing an authorized alteration. The public key cryptographic system adopts the method in which the sender compression-converts a message, encrypts it with a secret key and sends a public key attached to a sending sentence, while the receiver decrypts the attached data with the sender's public key and compare the data with data which has likewise been compression-converted. In this case, the receiver cannot commit an unauthorized alteration.
As described above, the technique of the public key cryptographic system is deemed to be indispensable for an authentication system. The public key cryptographic system, however, has a shortfall of needing a large volume of calculation processing for encryption and decryption and there accordingly are many cases of combining the secret key cryptographic system, which allows fast processing, for encrypting a message with the public key cryptographic system for authentication.
Currently, the most prominent in the public key cryptographic system is the RSA cryptography invented by R. Rivest, A. Shamir and L. Adlman in 1977.
The basic principle of the RSA cryptography is as described in the following.
[The Basic Algorithm of RSA]
In an encryption key (e,N) and the corresponding decryption key (d,N), the e and N are public keys and d is a secret key.
Where a plain text is M and a cryptogram is C, the algorithm of an encryption E and that of a decryption D are represented by the following:C=E(M)=Me mod N M=D(C)=Cd mod N; 
where:
d*e=1 mod LCM{(p−1), (q−1)}
N=p*q
LCM: the least common multiple
p, q: large prime numbers
Large integers, i.e., about 1024 bits, are usually used for the e, d, M and N and therefore even a use of high speed exponential calculation method requires about 1500 times of multiple precision multiplications and residue calculations for one RSA arithmetic operation.
Particularly for the reside calculation, a large number of high speed calculation methods have been proposed, such as an approximation method, a residue table method and Montgomery's algorithm, in order to speed up the arithmetic operation processing.
In order to process such power residue algorithms, represented by the RSA cryptography, which are frequently used for the public key cryptographic system, in high speed, a residue algorithm for each cycle of calculation must be processed in high speed.
Next is a description of a Montgomery's algorithm as one method for attaining the high speed processing of the residue arithmetic operation. For the Montgomery's algorithm, refer to: “Modulo Multiplication Without Trial Division”, Peter L. Montgomery, Mathematics of Computation, Volume 44, Number 170, April 1985 pp. 519 through 528.
[Montgomery's Algorithm]
A Montgomery's algorithm is the algorithm using the fact that a use of a modulo N and a radix R (where R>N) which is relatively prime to the modulo N enables the calculation of TR−1 mod N from a dividend T by dividing by a radix R, thereby performing a residue calculation without using the division by N.
In the following, let it be defined that N, N′, R, R−1 and T are integers, that the dividend T is 0≦T<R*N, that the R−1 is the inverse of the radix R on the modulo N, and that the relationship of R*R−1−N*N′=1 (where 0≦R−1<N, and O≦N′<R) applies.
Furthermore, when using a 2 to the power of exponential for the radix R, the division by the radix R can be substituted by a shift operation and therefore the high speed processing of the calculation of T→TR−1 mod N is enabled.
Next shows the algorithm REDC (T) of the T→TR−1 mod N as algorithm 1.
Here, it is proved that (T+m*N)/R is always divisible in the algorithm 1.
[Algorithm 1]
The algorithm Y=REDC (T) of T→TR−1 mod N is represented by the following:M=(T mod R)*N′ mod R Y=(T+M*N)/R 
if Y≧N then Y=Y−N
if Y<N then return Y
One operation of REDC calculates only TR−1 mod N, instead of a residue T mod N. Therefore, in order to calculate the residue T mod N, it only requires a REDC to be performed again by the product of REDC (T) and a pre-calculated R2 mod N as shown below:
                              REDC          ⁡                      (                                          REDC                ⁡                                  (                  T                  )                                            ⋆                              (                                                      R                    2                                    ⁢                  mod                  ⁢                                                                          ⁢                  N                                )                                      )                          =                ⁢                              (                                          TR                                  -                  1                                            ⁢              mod              ⁢                                                          ⁢              N                        )                    ⋆                      (                                          R                2                            ⁢              mod              ⁢                                                          ⁢              N                        )                    ⋆                                    R                              -                1                                      ⁢            mod            ⁢                                                  ⁢            N                                                  =                ⁢                              TR                          -              1                                ⋆                      R            2                    ⋆                                    R                              -                1                                      ⁢            mod            ⁢                                                  ⁢            N                                                  =                ⁢                  T          ⁢                                          ⁢          mod          ⁢                                          ⁢          N                    
As such, the residue T mod N can be calculated.
[The Extension of REDC to a Multiple Precision Calculation]
Next is an extension of the algorithm of REDC in the case of the modulo N or a radix R being a multiple length, that is, a multiple precision.
If the modulo N and a radix R each is a multiple precision, the calculations of (T mod R)*N′ and M*N is the processing of a multiple precision times multiple precision, requiring a general-purpose computer to endure very large processing volume and processing time. Therefore, what is shown next is an algorithm 2 extended so as to enable this part to be processed in multiple precision times single precision.
[Algorithm 2]
The algorithm extended for REDC to a multiple precision is as shown below. Where a dividend T, a parameter N″ and an output-use variable Y each is an r-adic number, and represented as follows:T=(tg-1, tg-2, . . . , t0)r,N=(ng-1, ng-2, . . . , n0)r,Y=(yg, yg-1, . . . , y0)r,
n′0,
r=2k,
R=rg;
and the TR−1 mod N can be calculated by a repetition process of j=0 through g−1 shown below as a multiple precision times single precision by using the following algorithm. Note that, here, “( )r” indicates that it is a number expressed in an r-adic number. Further, here, the definitions are that the single precision is a single digit in an r-adic number and that, when using the same character, a capital letter is a multiple precision, a lowercase letter is a single precision and an affix in a lower case letter is the position of digit in a multiple precision:
Y=T
for j=0 to g−1m=y0*n′0 mod r Y=Y+m*N Y=Y/r 
next
if Y≧N then Y=Y−N
if Y<N then return Y
Carrying out the REDC again by using the product between thusly obtained TR−1 mod N and the R2 mod N pre-calculated as described above makes it possible to calculate the T mod N.
[The Extension of REDC to a Multiple Precision Modular Multiplication]
Next is to extend the REDC algorithm to a modular multiplication arithmetic operation.
In the above algorithm, while the input T is a value satisfying 0≦T<R*N, there are many cases, in an actual RSA arithmetic operation, in which the input T is the multiplication result of integers A and B (where 0≦A and B<N).
In this case, the multiplication of the integers A and B is also a multiple precision integer arithmetic operation and therefore repeated calculations are performed similar to a multiple precision extension REDC.
In this case, if the multiplication and REDC are repeatedly calculated separately, a loss caused by a repeated calculation controls becomes twice.
Therefore, next shows an algorithm 3 extended so as to enable the multiplication and REDC in the same repeated loops.
[Algorithm 3]
The algorithm REDC (A*B) extending the REDC to a multiple precision modular multiplication is as follows.
Where two numbers A and B to be multiplied together, a parameter N′, a Montgomery multiplication parameter n′0 and an output-use variable Y each is an r-adic number and if they are represented by:A=(ag-1, ag-2, . . . , a0)r,B=(bg-1, bg-2, . . . , b0)r,N=(ng-1, ng-2, . . . , n0)r,Y=(yg, yg-1, . . . , y0)r,
n′0,
r=2k,
R=rg,
it is possible to calculate ABR−1 mod N as the calculation of a multiple precision times single precision through repeated process of j=0 through g−1 as shown below:
Y=0
for j=0 to g−1Y=Y+A*bj m=y0*n′0 mod r Y=Y+m*N Y=Y/r 
next
if Y≧N then Y=Y−N
if Y<N then return Y
In the above algorithm 3, the A and N are multiple precision, while the bj and n′0 are single precision, and the A and bj are multiplied together, thus meaning that it is calculated as that of a multiple precision times single precision.
It is possible to calculate AB mod N by performing an REDC again by using the product of thusly obtained ABR−1 mod N and the pre-calculated R2 mod N described above.
[Extension of REDC to a Multiple Precision Times Single Precision Processing]
While the algorithm 3 makes it possible to attain a multiple precision Montgomery modular multiplication in multiple precision times single precision, the arithmetic operation part of the multiple precision times single precision is extended to enable it as a combination with a calculation in single precision times single precision.
In this case, the calculation part of A*Bi and that of m*N become repeated calculations and, if the two multiplications are repeatedly calculated separately in a similar manner to the above described case, a loss due to repeated calculation controls results in twice.
Accordingly, if the two multiplications are designed to be performed in the same repetition loop, such a loss can be reduced.
Next shows an algorithm 4 which is extended so as to enable two multiplications in the same repetition loop.
[Algorithm 4]
An algorithm REDC (A*B) extending the REDC to a single precision times single precision is shown below.
Where two numbers A and B to be multiplied together, a parameter N′, a Montgomery multiplication parameter n′0, an output-use variable Y and a carry variable C each is an r-adic number and if they are represented by:A=(ag-1, ag-2, . . . , a0)r,B=(bg-1, bg-2, . . . , b0)r,N=(ng-1, ng-2, . . . , n0)r,Y=(yg, yg-1, . . . , y0)r,C=(c1, c0)r,
n′0,
r=2k,
R=rg;
and if the temporary variables of one digit of the r-adic are defined as tmp1, tmp2, tmp3 and tmp4, it is possible to obtain the ABR−1 mod N by the calculation in a single precision times single precision through repeated processing of i and j shown below:
Y=0 for j=0 to g−1     .................................  (tmp2,tmp1)r =y0 +a0*bj  m=tmp1*n′0 mod r  (tmp4,tmp1)r =tmp1+m*n0: Pre-core processing  (c1 ,c0 )r =tmp2+tmp4 for i=0 to g−1     .................................  (tmp3,tmp2,tmp1)r =yi +(c1 ,c0 )r +ai *bj  (tmp4,yi−1 )r =tmp1+m*ni: Core processing  (c1 ,c0 )r =tmp4+(tmp3,tmp2)r next i      .................................  (c1 ,c0 )r =(c1 ,c0 )r +yg  yg−1 =c0: Post-core processing  yg =c1 next j      ................................. if Y≧N then Y=Y−N: Correction processing if Y<N then return Y
Here, ( )r indicates that the variable of one digit of the r-adic number in parentheses is handled as a multiple precision. The tmp1 through tmp4, c0 and c1 each means that the number of multiple precision expressed in the r-adic number is divided into a plurality of bit strings and each of them is expressed by the value of a single precision. As an example, the tmp1 is the value of a single precision represented by the lower bit of the result from calculating a product and sum calculation y0+a0*bj in a single precision, and the tmp2 is the value of a single precision represented by the higher-order bit of the result from calculating the aforementioned product and sum calculation. Meanwhile, the tmp3 and c1 are represented by one digit of the r-adic number, with the content being the value of one bit.
For the output-use variable Y, the reason for storing an output in yi-1, where a value to be used for a calculation is yi, is that the function of Y=Y/r in the algorithm 3 is attained by thus storing it.
Further terminologies, for convenience, are such that the outside loop is called a j-loop, the inside loop is called an i-loop; the start from the beginning of the j-loop to the i-loop is called pre-core processing, the process within the i-loop is called core processing, and the end of the i-loop to that of the j-loop is called post-core processing.
FIG. 1 is the configuration diagram of a modular multiplication calculation apparatus executing the core processing of the above described algorithm 4.
Note that, in the entirety of the following descriptions, “k” of a k-bit is the number of bits representing the value of a single precision, with a specific example of k being 32 in the case of a 32-bit processor.
The modular multiplication calculation apparatus shown in FIG. 1 comprises an a product-sum calculation circuit 10 and a β product-sum calculation circuit 11 both for performing internal multiplication and addition; an A register 12 for retaining numbers A (ag-1, ag-2, . . . , a0) as multipliers (or multiplicands); a B register 13 for retaining numbers B (bg-1, bg-2, . . . , b0) as multipliers (or multiplicands); an N register 14 for retaining residue modulo N (ng-1, ng-2, . . . , n0); a Y register 15 for storing the lower k bits of the output of the β product-sum calculation circuit 11; an m register 16 for retaining a Montgomery parameter m; an addition circuit 17 as a carry calculation unit for adding the higher-order (k+1) bits of the output of the α product-sum calculation circuit 10 and the higher-order k bits of the output of the β product-sum calculation circuit 11; a C register 18 for storing the addition result of the addition circuit 17; and a selection circuit 19 for selecting an output by comparing the value of j with “0”.
Meanwhile, FIGS. 2A and 2B show the respective internal comprisals of the (product-sum calculation circuit 10 and A product-sum calculation circuit 11. The α product-sum calculation circuit 10 comprises a k-bit multiplier 20 for multiplying the output of the A register 12 and that of the B register 13; a 2k-bit adder 21 for adding the output of the k-bit multiplier 20 and that of the selection circuit 19 (i.e., the Y register 15); and a 2k+1-bit adder 22 for adding the output from the 2k-bit adder 21 and that of the C register 18.
The β product-sum calculation circuit 11 comprises a k-bit multiplier 23 for multiplying the output of the N register 14 and that of the m register 16, and a 2k-bit adder 24 for adding the output of the k-bit multiplier 23 and that of the lower k bits of the α product-sum calculation circuit 10.
FIG. 3 is a description diagram showing the content of the core processing of the algorithm 4. In the α product-sum calculation circuit 10, the output ai (k bits) of the A register 12 and the output bj (k bits) of the B register 13 are multiplied, and the multiplication result (2k+1 bits; 33) is added to the output (k bits; 31) of the selection circuit 19 (i.e., the Y register 15) and the output (k+1 bits; 32) of the C register 18.
Note that the selection circuit 19 compares the value of j with “0” and, if the value of j is “0”, outputs “0” to the α product-sum calculation circuit 10, while, if the value of j is not “0”, outputs the stored value yi of the Y register 15 to the α product-sum calculation circuit 10.
The α product-sum calculation circuit 10 outputs the higher-order (k+1) bits (34) of the arithmetic operation result (2k+1 bits; 33) to the addition circuit 17 and outputs the lower k bits (35) of the arithmetic operation result to the β product-sum calculation circuit 11.
Within the β product-sum calculation circuit 11, the output ni (k bits) of the N register 14 and the output (k bits) of the m register 16 are multiplied, and the multiplication result (2k bits; 36) is added to the lower k-bit output of the α product-sum calculation circuit 10.
The β product-sum calculation circuit 11 outputs the higher-order k bits (38) of the arithmetic operation result (2k bits; 37) to the addition circuit 17 and outputs the lower k bits (39) to the Y register 15, which in turn stores the k-bit data as the value yi-1.
The addition circuit 17 adds the output (k+1 bits; 34) of the α product-sum calculation circuit 10 and the output (k bits; 38) of the β product-sum calculation circuit 11, and outputs the addition result (k+1 bits) to the C register 18, which in turn stores it.
Next is a description of an algorithm 5 as a modified example of the algorithm 4.
[Algorithm 5]
Where two numbers A and B to be multiplied together, a parameter N, a Montgomery multiplication parameter n′0 and an output-use variable Y each is an r-adic number and if they are represented by:A=(ag-1, ag-2, . . . , a0)r,B=(bg-1, bg-2, . . . , b0)r,N=(ng-1, ng-2, . . . , n0)r,Y=(yg, yg-1, . . . , y0)r,
n′0,
r=2k,
R=rg,
and where the temporary variables of one digit of the r-adic number are tmp1, tmp2 and tmp4, and the carry variables are c1 and c2, it is possible to obtain an ABR−1 mod N as the calculation in a single precision times single precision through the repeated processing of i and j shown below:
Y=0 for j=0 to g−1     .................................  (tmp2,tmp1)r =y0 +a0 *bj  m=tmp1*n′0 mod r  (tmp4,tmp1)r =tmp1+m*n0: Pre-core processing  (c2 ,c1 )r =tmp2+tmp4 for i=1 to g−1     .................................  (tmp2,tmp1)r =yi +c1 +ai *bj  (tmp4,yi−1 )r =tmp1+m*ni: Core processing  (c2 ,c1 )r =tmp4+tmp2+c2 next i      .................................  (c2 ,c1 )r =(c2 ,c1 )r +yg  yg−1 =c1: Post-core processing  yg =c2 next j      ................................. if Y≧N then Y=Y−N: Correction processing if Y<N then return Y
Here, ( )r indicates that the variable of the r-adic one digit in parentheses is handled as a multiple precision. Further, while the carry variables c2 is represented by the r-adic one digit, the content is actually a value of one bit.
FIG. 4 is the configuration diagram of a modular multiplication calculation apparatus carrying out the core processing of the above described algorithm 5.
The modular multiplication calculation apparatus shown in FIG. 4 comprises a first product-sum calculation circuit 10a and a second product-sum calculation circuit 11a each for internally performing a multiplication and an addition; an A register 12 as a first register for retaining numbers A (ag-1, ag-2, . . . , a0) as multipliers (or multiplicands); a B register 13 as a second register for retaining numbers B (bg-1, bg-2, . . . , b0) as multipliers (or multiplicands); a Y register 15 as a third register for retaining the previous lower k-bit output of the second product-sum calculation circuit 11a and storing a posterior lower k-bit output; a c1 register 18a as a fourth register for retaining a carry variable c1; an N register 14 as a fifth register for retaining modulo N (ng-1, ng-2, . . . , n0); an m register 16 as a sixth register for retaining the parameter m of a Montgomery's algorithm; a c2 register 18b as a seventh register for retaining a carry variable c2; an addition circuit 17a as a carry calculation unit for adding the higher-order k-bit output of the second product-sum calculation circuit 11a and the output of the c2 register 18b; and a selection circuit 19 for comparing the value of j with “0” and selecting the output.
Meanwhile, FIGS. 5A and 5B show the internal comprisals of the first product-sum calculation circuit 10a and second product-sum calculation circuit 11a. 
The first product-sum calculation circuit 10a comprises a k-bit multiplier 20, a 2k-bit adder 21 and a 2k-bit adder 22a. The k-bit multiplier 20 multiplies the output of the A register 12 and that of the B register 13; the 2k-bit adder 21 adds the output of the k-bit multiplier 20 and that of the selection circuit 19 (i.e., the Y register 15); and the 2k-bit adder 22a adds the output of the 2k-bit adder 21 and that of the c1 register 18a. 
Note that the example configuration shown in FIG. 5A is such as to add the output of the selection circuit 19 (i.e., the Y register 15) to the multiplication result first, and then add the output of the c1 register 18a; it may be alternatively configured to add the output of the c1 register 18a and then that of the selection circuit 19 (i.e., the Y register 15).
The second product-sum calculation circuit 11a comprises a k-bit multiplier 23 for multiplying the output of the N register 14 and that of the m register 16; and a 2k-bit adder 24 for adding the output of the k-bit multiplier 23 and the lower k-bit output of the first product-sum calculation circuit 10a. 
FIG. 6 is a description diagram showing the content of the core processing of the algorithm 5.
Within the first product-sum calculation circuit 10a, the output ai (k bits) of the A register 12 and the output bj (k bits) of the B register 13 are multiplied and the multiplication result (2k bits; 30) is added to the output (k bits; 31) of the selection circuit 19 (i.e., the Y register 15) and the output (k bits; 32a) of the c1 register 18a. Note that the selection circuit 19 compares the value of j with “0” and, if the value of j is “0”, outputs “0” to the first product-sum calculation circuit 10a, while, if the value of j is not “0”, outputs the stored value yi of the Y register 15 to the first product-sum calculation circuit 10a which in turn outputs the higher-order k bits (34) of the arithmetic operation result (2k bits; 33) to the addition circuit 17a and outputs the lower k bits (35) thereof to the second product-sum calculation circuit 11a. 
Within the second product-sum calculation circuit 11a, the output ni (k bits) of the N register 14 and the output (k bits) of the m register 16 are multiplied and the multiplication result (2k bits; 36) is added to the output lower k bits (35) of the first product-sum calculation circuit 10a. The second product-sum calculation circuit 11a outputs the higher-order k bits (38) of the arithmetic operation result (2k bits; 37) to the addition circuit 17a and outputs the lower k bits (39) thereof to the Y register 15 which in turn stores the k-bit data as a value yi-1.
The addition circuit 17a adds the output (k bits; 34) of the first product-sum calculation circuit 10a, the output (k bits; 38) of the second product-sum calculation circuit 11a and the output (1 bit; 32b) of the c2 register 18b. It then respectively outputs, for use in the next cycle of arithmetic operations, the higher-order 1 bit of the addition result (k+1 bits) to the c2 register 18b and the lower k bits thereof to the c1 register 18a. The c1 register 18a and c2 register 18b respectively store them.
Next is a description of an algorithm 6 as further modification of the algorithm 5.
[Algorithm 6]
Where two numbers A and B to be multiplied together, a parameter N, a Montgomery multiplication parameter n′0 and an output-use variable Y each is an r-adic number and if they are represented by:A=(ag-1, ag-2, . . . , a0)r,B=(bg-1, bg-2, . . . , b0)r,N=(ng-1, ng-2, . . . , n0)r,Y=(yg, yg-1, . . . , y0)r,
n′0,
r=2k,
R=rg,
and the temporary variable of the r-adic one digit is tmp1, and carry variables are c3 and c4, the repeated processing of and j shown below make it possible to obtain an ABR−1 mod N as the calculation in a single precision times single precision:
Y=0 for j=0 to g−1     .................................  (c3 ,tmp1)r =y0 +a0 *bj  m=tmp1*n′0 mod r: Pre-core processing  (c4 ,tmp1)r =tmp1+m*n0 for i=1 to g−1     .................................  (c3 ,tmp1)r =yi +c3 +ai *bj  (c4 ,yi−1 )r =tmp1+m*ni +c4: Core processing next i      .................................  (c4 ,c3 )r =c3 +c4 +yg  yg−1 =c3: Post-core processing  yg =c4 next j      ................................. if Y≧N then Y=Y−N: Correction processing if Y<N then return Y
FIG. 7 is the comprisal diagram of a modular multiplication calculation apparatus carrying out the core processing of the above described algorithm 6.
The modular multiplication calculation apparatus shown in FIG. 7 comprises a third product-sum calculation circuit 10b and a fourth product-sum calculation circuit 11b both for internally performing a multiplication and an addition; an A register 12 as a first register similar to the configuration shown in FIG. 4; a B register 13 as a second register; a Y register 15 as a third register; an N register 14 as a fifth register; an m register 16 and a selection circuit 19 both as a sixth register; a c3 register 18c as a fourth register for retaining a carry variable c3; and a c4 register 18d as a seventh register for retaining a carry variable c4.
Note that the internal configurations of the third product-sum calculation circuit 10b and fourth product-sum calculation circuit 11b are the same as the internal configuration of the first product-sum calculation circuit 10a, and each of the product-sum calculation circuits 10b and 11b is constituted by a k-bit multiplier 20, a 2k-bit adder 21 and a 2k-bit adder 22a. 
Within the third product-sum calculation circuit 10b, the k-bit multiplier 20 multiplies the output of the A register 12 and that of the B register 13; the 2k-bit adder 21 adds the output of the k-bit multiplier 20 and that of the selection circuit 19 (i.e., the Y register 15); and the 2k-bit adder 22a adds the output c3 of the 2k-bit adder 21 and that of the c3 register 18c. 
Note that the configuration example of FIG. 5A is such as to add the multiplication result to the output of the selection circuit 19 (i.e., the Y register 15) first and then add the output of the c3 register 18c; it may alternatively be configured to add the output of the c3 register 18c first and then add that of the selection circuit 19 (i.e., the Y register 15).
In the meantime, within the fourth product-sum calculation circuit 11b, the k-bit multiplier 20 multiplies the output of the N register 14 and that of the m register 16; the 2k-bit adder 21 adds the output of the k-bit multiplier 20 and the lower k-bit output of the third product-sum calculation circuit 10b; and the 2k-bit adder 22a adds the output of the 2k-bit adder 21 and that of the c4 register 18d. 
Note that the configuration example of FIG. 5A is such as to add the multiplication result to the lower k-bit output of the third product-sum calculation circuit 10b first and then add the output of the c4 register 18d; it may alternatively be configured to add the output of the c4 register 18d first and then add the lower k-bit output of the third product-sum calculation circuit 10b. 
FIG. 8 is a description diagram showing the content of the core processing of the algorithm 6.
Within the third product-sum calculation circuit 10b, the output ai (k bits) of the A register 12 and the output bj (k bits) of the B register 13 are multiplied and the multiplication result (2k bits; 30) is added to the output (k bits; 31) of selection circuit 19 (i.e., the Y register 15) and the output (k bits; 32c) of the c3 register 18c. 
Note that the selection circuit 19 compares the value of j with “0” and, if the value of j is “0”, outputs “0” to the third product-sum calculation circuit 10b, while, if the value of j is not “0”, outputs the stored value yi of the Y register 15 to the third product-sum calculation circuit 10b. 
The third product-sum calculation circuit 10b outputs the higher-order k bits (34) of the arithmetic operation result (2k bits) to the c3 register 18c and the lower k bits (35) thereof to the fourth product-sum calculation circuit 11b. The c3 register 18c stores the k-bit as a carry variable for use in the next cycle of arithmetic operations.
Within the fourth product-sum calculation circuit 11b, the output ni (k bits) of the N register 14 and the output m (k bits) of the m register 16 are multiplied and the multiplication result (2k bits; 36) is added to the lower k-bit output (35) of the third product-sum calculation circuit 10b and the k-bit output (32d) of the c4 register 18d. 
The fourth product-sum calculation circuit 11b outputs the higher-order k bits (38) of the arithmetic operation result (2k bits; 37) to the c4 register 18d and the lower k bits (39) thereof to the Y register 15. The c4 register 18d stores the k-bit as a carry variable for use in the next cycle of arithmetic operations. The Y register 15 stores the k-bit data as a value yi-1.
FIG. 9 is the outline diagram of a flow chart exemplifying the modular multiplication processing by using the Montgomery method.
In the flow chart, the j loop corresponds to the loop processing of the algorithm 3. In the inside of the j loop, the partial multiplication of A*bj and m*n in a multiple precision times single precision is carried out. The i loop is the part carrying out the calculations of a multiple precision times single precision of A*bj and m*n by means of the partial multiplications in a single precision times single precision.
The following is a description of the specific example of the outline process flow including the processing, other than the core processing, in accordance with the algorithm 6.
In the following example, the assumption is that the bit length of N, A, and B are 1024 bits and the unit of processing is k=32, R=21024 and r=232 
[Pre-Core Processing]
FIG. 10 is a diagram exemplifying the configuration for carrying out pre-core processing.
A register 41 retains the parameter n′0 for use in a Montgomery calculation. A multiplication circuit 40 multiplies the output of a third product-sum calculation circuit 10b and that of a register 41. The pre-core processing initializes a c3 register 18c, a c4 register 18d and an m register 16a, which are used in the core processing. The third product-sum calculation circuit 10b first multiplies the respective inputs a0 and bj from an A register 12 and a B register 13, and then adds the multiplication result to the input y0 from a Y register 15.
Note that, if the same product-sum calculation circuit as in the core processing is used, the result is further added to “0”.
Then, the third product-sum calculation circuit 10b stores the higher-order 32 bits of the result in the c3 register 18c and outputs the lower 32 bits thereof to a fourth product-sum calculation circuit 11b and the multiplication circuit 40 used for calculating a parameter m.
The multiplication circuit 40 multiplies the output of the third product-sum calculation circuit 10b and the output n′0 of the register 41 and outputs the lower 32 bits of the multiplication result to the m register 16a. The fourth product-sum calculation circuit 11b multiplies the input n0 from an N register 14 and the value of the m register 16a and adds the multiplication result to the output of the third product-sum calculation circuit 10b. 
Note that, if the same product-sum calculation circuit as in the core processing is used, the result is further added to “0”.
Then the fourth product-sum calculation circuit 11b stores the higher-order32 bits of the result in the c4 register 18d. The lower 32 bits are not used.
[Core Processing]
FIG. 11 is a diagram exemplifying the configuration for carrying out the core processing that is i loop internal processing.
A Y register 15 is a register used for retaining the previous processing result and outputting the present processing result. A selection circuit 19 is equivalent to the processing of Y=0 in the algorithm 3. A third product-sum calculation circuit 10b first multiplies the respective inputs ai and bj from an A register 12 and a B register 13, adds the multiplication result to the input yi from the Y register 15 and further adds the addition result to the value of a c3 register 18c. Then the third product-sum calculation circuit 10b stores the higher-order 32 bits of the addition result in the c3 register 18c and outputs the lower 32 bits thereof to a fourth product-sum calculation circuit 11b. 
The fourth product-sum calculation circuit 11b first multiplies the input ni from an N register 14 and the value of an m register 16a, adds the multiplication result to the output of the third product-sum calculation circuit 10b and further adds the addition result to the value of a c4 register 18d. Then, the fourth product-sum calculation circuit 11b stores the higher-order 32 bits of the addition result in the c4 register 18d and stores the lower 32 bits thereof in the yi-1 of the Y register 15.
The processing of Y=Y/r is attained by storing the i-th turn calculation result in the yi-1.
[Post-Core Processing]
FIG. 12 is a diagram exemplifying the configuration for carrying out post-core processing.
An addition circuit 45 adds the output of a c3 register 18c, that of a c4 register 18d and that of a selection circuit 19. A selection circuit 46 compares the carry output from the addition circuit 45 with “0” and “1”, and outputs “0” to a Y register 15 if the carry output is “0”, or outputs “1” thereto if the output is “1”. The post-core processing carries out the processing of the values of the carry variables c3 and c4 after the core processing.
The processing is such as to input, to the addition circuit 45, the values of the c3 register 18c and c4 register 18d and the input y32 from the Y register 15, to output the addition result to the y31 of the Y register 15, to convert the carry into the value of 32 bits, which is the unit of processing, and to output the y32 of the Y register 15. Here, as is clear from the output, although the value of y32 is handled as 32-bit at the Y register 15, it is actually the value of 1 bit, and the addition result is within the range of 32 bits plus carry.
[Configuration of a Product-Sum Calculation Circuit]
FIG. 13 is a diagram exemplifying the configuration of the product-sum calculation circuit used in the configuration example described above.
Here, the configuration is such that all the unit of processing is 32-bit.
The product-sum calculation circuit comprises one 32-bit multiplier 51 and four 32-bit adders 52, 53, 54 and 55.
The input values of A and B are multiplied by the 32-bit multiplier 51 and thereby two parts, i.e., the higher-order32 bits and lower 32 bits, are output. The 32-bit adder 53 adds the lower 32-bit output of the 32-bit multiplier 51 to the value of an input R and outputs the 32-bit output of the addition result to the 32-bit adder 55 and the carry to the 32-bit adder 52. The 32-bit adder 52 adds the higher-order 32-bit output of the 32-bit multiplier 51 to the carry output of the 32-bit adder 53 and outputs the 32-bit output of the addition result to the 32-bit adder 54. It is theoretically proven that the aforementioned addition does not generate a carry. The 32-bit adder 55 adds the output of the 32-bit adder 53 to the value of an input C and outputs the 32-bit output of the addition result as the L output (i.e., the lower 32 bits) of the product-sum calculation circuit and outputs the carry to the 32-bit adder 54. The 32-bit adder 54 adds the output of the 32-bit adder 52 to the carry output of the 32-bit adder 55 and outputs the 32-bit output of the addition result as the H output (i.e., the higher-order 32 bits) of the product-sum calculation circuit. It is theoretically proven that the aforementioned addition does not generate a carry.
The entirety of the algorithms and circuits described above is noted in reference patent document 1.
The product-sum calculation circuit used in the conventional modular multiplication calculation apparatus described above is the product-sum calculation circuit with four inputs and two outputs as shown in FIG. 13. The product-sum calculation circuit performs the multiplication of two inputs and the addition of the multiplication and two remaining inputs. A common processor is not equipped with a four-input two-output product-sum calculation circuit and hence there is a problem that the two product-sum calculation circuits shown in FIG. 12 cannot be simplified. Further, the Pentium (registered trademark) processor is usually only equipped with a two-input two-output multiplier. Nonetheless, a performance degradation has not created a substantial problem in the case of implementing the function similar to the above described four-input two output product-sum calculation circuit by using the two-input two output multiplier and adder. Recent processors, represented by the Itanium2 processor, however, are equipped with a three-input two-output product-sum calculation circuit. Particularly, the Itanium2 processor is equipped with two pieces of three-input two-output product-sum calculation circuits. The algorithm using the conventional four-input two-output product-sum arithmetic operation described above, however, cannot bring out the capability, to the maximum, of such a processor equipped with three-input two-output product-sum calculation circuit.
Patent document 1: Laid-Open Japanese Patent Application Publication No. H11-212456