1. Field of the Invention
The present invention relates to a modular multiplication apparatus, a modular multiplication method, and a modular exponentiation apparatus which can calculate a large integer at a high speed by pipeline process.
2. Description of the Related Art
As a method of implementing “modular exponentiation” for use in an RSA public key cryptosystem, there is known a method for repeatedly executing multiplication with a modulo, which is called Montgomery multiplication proposed by Montgomery. For example, reference should be made to a first document, “Modular multiplication without trial division” by P. L. Montgomery, Mathematics of Computation, Vol. 44, No. 170, pp. 519-521, April 1985.
This Montgomery multiplication is a technique for executing multiplication with a residue by being substituted by addition and multiplication, and efficient implementation of modular multiplication is possible. In the Montgomery multiplication, calculation of xy+Nt is executed instead of calculating xy mod N. Here, x, y, N, and t are large integers.
In a conventional system of high speed implementation technique for Montgomery multiplication can be roughly divided into a system of carrying out Montgomery multiplication under which an integer is expressed in a binary expression and a system of carrying out Montgomery multiplication under which an integer is expressed by an RNS (Residue Number System).
In the system to be carried out under binary expression, x, y, N, and t are divided respectively in units of words. For example, x is divided into m words of x(0), . . . , x(m−1) from the least significant bit by r bits. This applies to y, N and t. Under this condition, Montgomery multiplication is carried out by subjecting multiplication in units of words to pipeline process by a plurality of processing units. That is, multiplication of x(i)y(j) and N(i)t(j) and summing calculation thereof are executed by assigning them to u units considering digit adjustment or digit round-up. At this time, it is important in algorithm of Montgomery multiplication that t(j) is a variable determined depending on a result of calculation in progress, and contrivance on implementation is requested here. Reference should be made to a second document, “Montgomery's Multiplication Technique: How to Make It Smaller and Faster” by C. D. Walter, Proc. Workshop on Cryptographic hardware and Embedded Systems, CHES' 99, LNCS, Vol. 1717, pp. 80-93, Springer-Veriag, 1999, third document, “Montgomery's Modular-Multiplication Method and Systolic-Arrays Suitable for Modular-Exponentiation”, by Keiichi Iwamura et al., IEICE Transaction, Fundamentals (Japanese edition), vol. J76-A, No. 8, pp. 1214-1223, 1993, and fourth document, Japanese Patent Publication (KOKAI) No. 5-324277.
The second document discloses that each unit has two multiplier for the purpose of executing calculation of x(i)y(j)+N(i)t(j) in a single clock in a unit “i”. However, because of a trade-off with calculation t(j), each standby state occurs in two clocks, and further, redundant multiplication occurs for calculation of t(j). As a result, a delay is imparted to this calculation due to pipeline process (a critical path is lengthened).
The fourth document discloses that each unit has two multipliers as in the second document for the purpose of executing calculation of x(j)y(i+1)+N(j+1)t(i) in a single clock in a unit “i”. According to the fourth document, an algorithm in which a standby state as in the second document does not occur is provided. Instead, each unit comprises a third multiplier for calculation of t(i). In this calculation of t(i), one unit does not require only one of all the clocks required in Montgomery multiplication (the number of clocks has the same order as word length). Thus, a hardware configuration in which throughput per circuit scale (the number of multipliers) is not optimized is provided. Further, as in the second document, redundant multiplication occurs for calculation of t(i). As a result, a delay is imparted to this calculation due to pipeline process.
Further, in the second and third documents, calculation is carried out by using processing units equal to or greater than word count of a modulo N in number in order to execute Montgomery multiplication at high speed. In general, in RSA cryptosystem, a large integer such as 1024 bits is used as a size of a modulo N, and a value such as 32 bits is used as a single word length r. Therefore, in the case where ideas of the second and third documents are implemented, there is a problem that a large number of processing units are required, and a hardware is enlarged in scale concurrently. Realistically, there can occur a request for reducing the hardware scale instead of permitting the lowering of throughput to a certain extent. Namely, Montgomery multiplication must be carried out in number of processing units equal to or smaller than a word length. This can be accomplished by assigning calculation of a plurality of words to a single processing unit. However, there exists a number of dependencies such as digit round-up or integration described later between words. Therefore, there is a problem that the lowering of efficiency which is proportional to or greater than the simply reduced number of processing units occurs.
With respect to a system of carrying out Montgomery multiplication under which an integer is expressed by RNS, there is provided a system in which calculation for each base is carried out in parallel by a plurality of processing units. Reference should be made to a fifth document, “Cox-Rower Architecture for Fast Parallel Montgomery Multiplication” by S. Kawamura, M. Koike, F. Sano and A. Shimbo, Advances in Cryptology-Eurocrypt 2000, LNCS 1807, pp. 523-538, 2000.”
In the system described in the fifth document, calculation of each base is processed to be distributed to a plurality of processing units. Dependency of calculation for each base is low, and thus, there is provided an advantage that, even when the number of processing units is reduced to be equal to or smaller than the number of bases, only the lowering of efficiency proportional to each reduction occurs. However, there has been a disadvantage that a ROM must be provided in each processing unit, and a circuit is enlarged in scale.
As described above, in Montgomery multiplication, although “t” is calculated by using a result of calculation in progress, it is desirable that this calculation of “t” be executed so as not to impart a delay to this calculation due to pipeline process. However, in the above described conventional technique, such an algorithm and hardware configuration are not provided. Therefore, there has been a problem that a redundant delay occurs for calculation of “t.”
In the second document, each standby state occurs in two clocks in this calculation due to pipeline process for the sake of calculation of “t.” Further, in the fourth document, a redundant multiplier for calculation of “t” is required for all units. Thus, there has been an algorithm and a hardware configuration in which throughput per circuit scale is not optimized.
In addition, in implementation of Montgomery multiplication, in a system using binary expression, Montgomery multiplication is implemented by an apparatus comprising processing units equal to or greater than word count of N in number. In the case of considering actual implementation, it is considered difficult to provide processing units equal to such word count in number. Thus, there is a request to reduce a hardware scale instead of permitting the lowering of a process speed to a certain extent. However, when an attempt is made to carry out Montgomery multiplication by simply reducing the number of processing units, there is a problem that the lowering of efficiency which is proportional to or greater than the reduced number of processing units occurs.
In contrast, in a system using RNS expression, dependency of calculation for each base is low. Thus, even when the number of processing units is reduced to be equal to or smaller than the number of bases, only the lowering of efficiency proportional to such reduction occurs. However, there is a disadvantage that the circuit scale per processing unit is enlarged.