The present invention relates to an arithmetic circuit, and more particularly to a circuit for increasing the speed for modular multiplication in a public-key cryptosystem.
Public-key cryptography (asymmetrical cryptography) is generally used to maintain the confidentiality and guarantee the digital authenticity of information being transmitted. Public-key cryptography is a system for transmitting information and a cryptographic method for transmitting information using a pair of keys consisting of a public key and a secret key. A sender uses the public key of a recipient to encrypt a text, and when the cipher text is decrypted, a secret key known only to the recipient is used. According to the public-key cryptography, since (a) unlike the common-key cryptography (symmetrical cryptography) there is no need for correspondents to share and employ a single common key and (b) widespread disclosure of a public key involves no appreciable risk, the maintenance of secrecy while communicating with an unlimited number of persons is possible. Further, when public-key cryptography is employed for digital authentication or to prepare a digital signature, the credibility and trustworthiness of a person with whom one is not acquainted can be established. Therefore, it can readily be asserted that public-key cryptography is a requisite technique for a network supported by a communication system, such as the Internet, and for business transactions that are entered into in such a network.
RSA is the most popular public-key cryptography. The safety afforded by RSA is based on a discrete logarithm problem for a very large integer, or on the difficulty encountered in factoring primes. For example, plaintext M is encrypted into ciphertext C by using a public key (e,n) in accordance with the relational equation C=Me(mod n), (M is formed as a block that it is smaller than the integer n). To decrypt the ciphertext C, a discrete logarithm problem (while using a, y and p, find x that satisfies y=ax(mod p)) must be performed, and the amount represented by O(2SQRT(log n)) must be solved (SQRT is a function for providing a square root). When the integer n is a value having a length that is at least equal to or greater than 512 bits, and that preferably is equal to or greater than 1024 bits, code breaking within a practical time is difficult.
However, when a secret key (d,n) is employed that has the following relationship with the public key (e,n), ed(mod lcm(pxe2x88x921,qxe2x88x921))=1, n=pq (wherein p and q are satisfactorily large prime numbers), the plaintext M can be easily obtained by using the relational equation M=Cd(mod n) (wherein lcm(a,b) provides the least common product of a and b).
By using a binary representation of an exponent, and the modular-squaring operation and the modular-multiplication operation are repeated, so that at most twice the bit length of the exponent is required for the modular-multiplication operation.
However, even the above described modular-exponentiation operation requires more calculations than are required for symmetrical cryptography, such as DES (Data Encryption Standard). Therefore, the preparation and use of as efficient an algorithm as possible is demanded.
The Montgomery multiplication method is a method for increasing the speed of the modular-squaring operation and the modular-multiplication operation in the above modular-exponentiation operation. The Montgomery multiplication method, as described in xe2x80x9cModular Multiplication Without Trial Divisionxe2x80x9d, by Peter L. Montgomery, Mathematics of computations, Vol. 44, No. 170 April 1985, pp. 519-522, is a method whereby addition, multiplication and a shift operation are repeated to perform the modular-multiplication operation while requiring fewer calculations than are required by division for which subtraction is repetitively performed. The essential calculation portion Pxe2x89xa1XYRxe2x88x921(mod n) of the Montgomery multiplication is shown below using pseudo code 1.x. It should be noted that in Pxe2x89xa1XYRxe2x88x921(mod n), R=(2r)m and Nxe2x89xa1xe2x88x92nxe2x88x921(mod 2r). Further, it should be noted that a line number is added to the left of each line in the pseudo code (this provision is hereinafter applied).
(1.1) p=0;
(1.2) for (i=0; i less than m; i++){
(1.3) t=(p0+xiy0) N(mod 2r);
(1.4) P=(P+xiY+txc2x7n)/2r;
(1.5)}
(1.6) if (Pxe2x89xa7n) P=Pxe2x88x92n;
As is shown in the pseudo code 1.x, the repetitive calculation of the essential portion is performed as follows. First, X is divided into m blocks xi (X=(xmxe2x88x921, xmxe2x88x922, . . . , x1, x0)), and a partial product addition of (xiY) with Y is repeated m times (line numbers 1.2 to 1.5). At this time, a product xe2x80x9ctxc2x7nxe2x80x9d is added each time to make p0 equal to 0, where p0 is the lowest block of the intermediate results P (line number 1.4). In this case, t is defined in line 1.3. Further, P is shifted to the right r bits, i.e., is multiplied by 2xe2x88x92r (line number 1.4). It should be noted that since 2xe2x88x92rm=Rxe2x88x921 is obtained by performing the r-bit shift operation m times.
Assuming that a 32-bit multiplier is used to perform the Montgomery multiplication of 512 bits, a loop is repeated 512/32=16 times. In the above pseudo code, 32 bitsxc3x97512 bits, such as xixc2x7Y or txc2x7n, is shown for simplification; actually, however, Y and n, of 512 bits each, are divided into 32-bit blocks for calculation. That is, in the calculation the partial product, addition of P is a double loop for which m=16. An example process for performing the Montgomery multiplication using a double loop is shown below using pseudo code 2.x.
(2.1) P=0;
(2.2) for (i=0; i less than m; i++){
(2.3) t=p0+xiy0(mod 2r);
(2.4) t=txc2x7N(mod 2r);
(2.5) c=0;
(2.6) for(j=0; j less than m; j++){
(2.7) tmp=pj+xiyj+c;
(2.8) tmp=tmp+txc2x7nj;
(2.9) if (j!=0)pjxe2x88x921=tmp(mod 2r);
(2.10) c=tmp/2r;
(2.11)}
(2.12) Pmxe2x88x921=c;
(2.13)}
(2.14) if (Pxe2x89xa7n) P=Pxe2x88x92n;
In this case, X, Y and n are divided into m blocks, i.e.,
X=(xmxe2x88x921, xmxe2x88x922, . . . x1, x0)
Y=(ymxe2x88x921, ymxe2x88x922, . . . y1, y0)
n=(nmxe2x88x921, nmxe2x88x922, . . . n1, n0)
Assuming one multiplier is employed, two product additions are required for the calculation of the intermediate result tmp. Variables pj, xi, yj, t and nj are r-bit length numbers, and variable c is a carry from a lower block. In the above pseudo code 2.x, in one iteration of the j-loop, the addition of the 2r-bit length numbers xixc2x7yj and txc2x7nj and the addition of the r+1-bit length number pj and c are performed (line numbers 2.6 to 2.11), so that following the product addition the intermediate result tmp is a 2r+1-bit length number. The lower r bits of tmp are stored as variable pj, and the upper r+1 bits are stored as the variable c (line numbers 2.9 and 2.10).
In contrast, the addition of xixc2x7yj and txc2x7nj can be performed as two separate loops, as is shown in the following example, using pseudo code 3.x.
(3.1) P=0;
(3.2) for (i=0; i less than m; i++){
(3.3) c=0;
(3.4) for (j=0; j less than m; j++){
(3.5) tmp=pj+xixc2x7yj+c;
(3.6) pj=tmp(mod 2r);
(3.7) c=tmp/2r;
(3.8)};
(3.9) pm=c; c=0;
(3.10) t=p0xc2x7N(mod 2r);
(3.11) for (j=0; j less than m; j++){
(3.12) tmp=pj+txc2x7nj+c;
(3.13) if (j!=0)pjxe2x88x921=tmp(mod 2r);
(3.14) c=tmp/2r;
(3.15)};
(3.16) pmxe2x88x921=pm+c;
(3.17)};
(3.18) if (Pxe2x89xa7n) P=Pxe2x88x92n;
In the pseudo code 3.x example, the variable pj has the r-bit length, and the variable tmp has the 2r-bit length.
In either case, for both pseudo code 2.x and 3.x, the result P, obtained following the double loop used for the Montgomery multiplication, is smaller than 2n. However, since there is no guarantee that the result P will be equal to or smaller than n, xe2x80x9cP=Pxe2x88x92nxe2x80x9d is required, as needed (line numbers 2.14 and 3.18).
In the examples for pseudo codes 2.x and 3.x, basically, pj is a positive value, stored in an r-bit register, that has been divide the result P into m blocks. However, in the pseudo code 2.x example, c in xe2x80x9cpmxe2x88x921=cxe2x80x9d on line 2.12 is a maximum r+1 bits. At this time, when the number of bits is divisible by r bits, there is a one bit overflow. Thus, a special process must be performed for the most significant bit to decrement it to m or the loop iteration count must be incremented to m+1. In this specification, to simplify the explanation, it is assumed that in both the above two code examples the j-loop iteration count is m. It should be noted that, in the comparison calculation xe2x80x9cif (Pxe2x89xa7n)xe2x80x9d on lines 2.14 and 3.18 of the two examples, a sign bit must be taken into account.
The Montgomery multiplication can be performed in the manner as is shown in the pseudo codes 2.x and 3.x. However, through a detailed, careful study of the calculation cycle, the performance of the calculation is greatly affected by the register allocation of the variables and the memory structure. The discussion results and problems associated with these codes will now be described.
Reference the case wherein the pseudo code 2.x is used, the variable N, having an r-bit length, is repetitively employed in the i-loop (pseudo code 2.2 to 2.13) without its value being changed, and variables t and xi are repetitively used in the j-loop (pseudo code 2.6 to 2.11) without their values being changed. Therefore, once the variables N, t and xi have been calculated or read from memory, they can be stored in the registers and repetitively employed until the loops are exited. The variables tmp and c that are used in the j-loop are directly allocated to the registers, and the other variables are stored in a general two-port memory wherein reading and writing to independent addresses can be performed. Further, it is assumed that the memory reading process sequence and the calculation and the memory writing are performed along a pipeline. However, in this case, since the calculation and the memory writing are terminated after only one cycle, memory reading constitutes a bottleneck during the performance. For this reason, the number of cycles devoted to memory reading will now be considered.
First, for the process P=0 on line 2.1, this value need only be reset when the variable pj is read the first time for the calculation performed on line 2.7, i.e., tmp=pj+xixc2x7yj+c. Therefore, no operation cycle is required.
For the calculations for variable t on lines 2.3 and 2.4, i.e., t=p0+xiy0(mod 2r) and t=txc2x7N(mod 2r), three cycles are required to read variables p0, xi and y0, other than the variable N that is allocated to the register. It should be noted that the variable t need not be read because, as is described above, the result obtained on line 2.3 is directly allocated for the register. For xe2x80x9cc=0xe2x80x9d on line 2.5, no operation cycle is required, as it also is for the process for P=0.
When the process sequence enters the j-loop, for the calculation tmp=pj+xixc2x7yj+c performed on line 2.7 two cycles are required to read variables pj and yj. The variable xi need not be read here since it is read at the time the variable t is calculated on line 2.3. Further, the variable c also need not be read because it is directly allocated to a register, as was previously remarked.
For the calculation performed for line 2.8, i.e., tmp=tmp+txc2x7nj, one cycle is required to read the variable nj, but the variables tmp and t need not be read since, as is described above, they are directly allocated to registers.
In the actual hardware operation for the calculations on line 2.9, i.e.,
Pjxe2x88x921=tmp(mod 2r)
c=tmp/2r,
xe2x80x9ctmp+txc2x7njxe2x80x9d must be written directly to the variables pjxe2x88x921 and c, rather than being temporarily allocated to the register tmp. Thus, both calculations require no operation cycle.
For the calculation pmxe2x88x921=c performed on line 2.12, immediately after the process has exited the j-loop, no operation cycle is required by the pipeline process.
Therefore, the number of operation cycles in the i-loop is 3+m(2+1)=3m+3. Then, this is multiplied by m of the i-loop, and m(3m+3)=3m2+3m.
However, since pj=0, because, as is described above, i=0, pj need not be read, and the input need only be reset to 0. Overall, therefore, for the i-loop, m is subtracted from the above result and
3m2+3mxe2x88x92m=3m2+2m
is obtained. In the comparison process in line 2.14, xe2x80x9cif (Pxe2x89xa7n), which value is greater or smaller is determined by the subtraction operation xe2x80x9cPxe2x88x92nxe2x80x9d, so that 2m cycles are required, and two cycles are required before the process finally exits the pipeline. Overall, therefore, the required number of cycles is 3m2+4m+2. This is a case wherein the pipeline operation can be performed without being disturbed. When the number of operation bits is not very much greater than the number used by multipliers, i.e., when the block count m is small, a value to be written must be read immediately, so that for the memory access a waiting overhead is added. It should be noted that this problem does not occur when the block count is from four to five. Further, since the use of the Montgomery method is not very effective for a small number of blocks, such as two or three, the above problem need not be taken into account. In other words, the coefficient xe2x80x9c3xe2x80x9d of the term m2 is not greatly changed due to the double loop. Further, the pipeline process may be disturbed before or after the j-loop, and the finally obtained number of cycles may differ. However, even in this case, in the double loop, the coefficient xe2x80x9c3xe2x80x9d of term m2 is unchanged, and the final cycle number does not greatly differ from the above described value.
The pseudo code 3.x example will now be similarly discussed as follows. For the process for xe2x80x9cP=0xe2x80x9d in line 3.1, the value of P need only be reset to 0 when pj is read the first time for the calculation tmp=pj+xixc2x7yj+c performed on line 3.5, so that no operation cycle is required. For the process for xe2x80x9cc=0xe2x80x9d on line 3.1, no operation cycle is required, as it also was in the pseudo code 2.x example.
When the process enters the first j-loop (line numbers 3.4 to 3.8), for the calculation performed on line 3.5, i.e., tmp=pj+xixc2x7yj+c, two cycles are required to read pj and yj, and xi must be read the first j-loop. For the calculation performed on line 3.6, i.e.,
xe2x80x83pj=tmp(mod 2r)
c=tmp/2r,
no cycle is required since tmp is written directly to pj and c, as in the pseudo code 2.x example. Therefore, the cycle count for the first j-loop is 2m+1.
After the process exits the first j-loop, because of the pipeline process the calculation performed on line 3.9, xe2x80x9cpm=c; c=0;xe2x80x9d, requires no cycle, while for the calculation performed on line 3.10, t =p0xc2x7N(mod 2r), two cycles are required to read p0 and N.
Following this, the process enters the second j-loop (line numbers 3.11 to 3.15), and for the calculation performed on line 3.12, tmp=pj+txc2x7nj+c, two cycles are required to read pj and nj. It should be noted that t and c need not be read because they are written directly to registers. In the following calculation performed on line 3.13,
pjxe2x88x921=tmp(mod 2r)
c=tmp/2r,
no cycle is required, as in the pseudo code 2.x example. Therefore, the number of cycles required for the second j-loop is 2m. Then, after the process exits the second j-loop, for the calculation performed on line 3.16, i.e., pmxe2x88x921=pm+c, one cycle is required to read pm.
Therefore, the number of operation cycles required for the i-loop is (2m+1)+2+2m+1=4m+4. This cycle count is multiplied by m in the i-loop, and m(4m+4)=4m2+4m is obtained. However, as is described above, pj need not be read when i=0, and since the input must then merely be reset to 0, overall, the number of cycles required for the i-loop is decremented by m, and is 4m2+4mxe2x88x92m=4m2+3m. In the comparison process performed on line 3.18, i.e., if (Pxe2x89xa7n), 2m cycles are required to use the substraction operation xe2x80x9cPxe2x88x92nxe2x80x9d to determine which value is greater or smaller, and two cycles are required to exit the pipeline process. As a result, overall, the required number of cycles is 4m2+5m+2. The number of cycles required by the pseudo code 3.x example is greater by m2 because, since the partial product addition to the tmp is not performed during a single iteration of the j-loop, the variable pj that from line 3.4 to line 3.8 is written to memory is read from line 3.11 to line 3.15. The pseudo code 3.x example is more advantageous than is the pseudo code 2.x example because, even though many cycles are required, the carry is smaller than that of the pseudo code 2.x by one bit, and can be stored into the r-bit block, and less number of registers are required. Therefore, the pseudo code 3.x example is more advantageous with respect to complexity of control and the circuit size. FIG. 7 is a circuit diagram showing a Montgomery multiplier data path unit for the pseudo code 3.x, and FIG. 8 is a timing chart for the vicinity of the final loop iteration performed by this multiplier when the block count m=4. It should be noted that, as in the pseudo code 2.x example, overall, the number of reading cycles can be slightly changed due to the number of the pipeline stages.
Consider the increase in the speed of the Montgomery multiplication relative to the memory reading timing. Since multiplication is performed twice in the double loop, it seems that the processing speed can be doubled simply by implementing two multipliers. However, as is described above, the memory reading constitutes a bottleneck. In order to eliminate this bottleneck, a three-port memory having two read ports for independent reading and one write port can be employed. The operation speed can be increased by using a three-port memory, however technology libraries that support the memory are limited. Therefore use of three-port memory may become a barrier to register the Montgomery multiplier as an IP core macro. Further, a three-port memory requires a large circuit size, and does not respond to a size reduction demand.
It is, therefor, one object of the present invention to resolve the problem posed by a memory access bottleneck in a Montgomery multiplier, while using a general-purpose memory having two ports or only a single port. Thus, higher performance and element downsizing can be attained, and a lower cost cryptography circuit can be provided.
The specific effects obtained by the invention are as follows. The bottleneck encountered with the Montgomery multiplier when memory is accessed can be eliminated by using general two-port memories or single-port memories. And as a result, high-level performances can be attained while circuit sizes are reduced, and a low-cost cryptographic hardware can be provided.
According to an arithmetic circuit and an arithmetic method, variables that are required for an arithmetic operation performed during a single cycle and that must be read from memory can be stored in a different memory, and may be read at the same time during a single reading stage. In this manner, reading wait time can be minimized, and the performance of calculations can be improved. Furthermore, a complicated circuit or a circuit that requires a large hardware resource, such as a three-port memory, need not be employed as the memory used for storing variables, and instead, by employing a general-purpose memory having two ports or only a single port, an IP core can be easily formed, and the circuit size can be minimized.
According to the present invention, an arithmetic circuit comprises: multiple registers; an arithmetic unit, for regarding, as inputs, values entered in the multiple registers; and multiple memories, wherein the reading of multiple variables from the multiple memories to the multiple registers is performed during the same reading stage using a pipeline process performed by the arithmetic unit. Furthermore, according to the present invention, an arithmetic method using an arithmetic circuit that includes an arithmetic unit, which has multiple input registers and multiple memories, comprises the steps of: performing an arithmetic operation based on values stored in the input registers; writing the results of the arithmetic operation in the input registers or the memories; and reading multiple variables from the multiple memories and storing the variables in the multiple input registers during the same pipeline stage.
The arithmetic unit can be a multiplier adder for, based on values x1, x2, x3 and x4 having an r-bit length that are respectively input to a first register, a second register, a third register and a fourth register, providing the operation results Q for x1+x2xc2x7x3+x4 having a length of 2r bits or 2r+1 bits. In this case, the multiple memories include a first memory and a second memory. At a stage for writing an operation result, which follows the operation stage of the pipeline process, lower r bits QL of the operation result Q are recorded in the first memory, and upper bits QH of the operation result Q, excluding the bits QL, are recorded in the fourth register, while at a stage for reading variables from the registers, which follows the writing stage, simultaneously, a variable x1 is read from the first memory and is stored in the first register, and a variable x3 is read from the second memory and is stored in the third register. Further, the first memory and the second memory can be two-port memories having one data writing port and one data reading port, or the first memory can be a two-port memory having one data writing port and one data reading port, while the second memory can be a single-port memory having one port for the writing and reading of data.
In addition, the arithmetic unit can be a multiplier adder for, based on values x1, x2, x3, x4 x5 and x6, having an r-bit length, that are respectively input to a first register, a second register, a third register, a fourth register, a fifth register and a sixth register, and for providing the operation results Q for x1+x2xc2x7x3+x4xc2x7x5+x6, which have a length of 2r bits or 2r+1 bits. In this case, the multiple memories include a first memory, a second memory and a third memory. At a stage for writing an operation result, which follows the operation stage of the pipeline process, lower r bits QL of the operation result Q are recorded in the first memory, and upper bits QH of the operation result Q, excluding the bits QL, are recorded in the sixth register, and at a stage for reading variables to the registers, which follows the writing stage, simultaneously, a variable x1 is read from the first memory and is stored in the first register, a variable x3 is read from the second memory and is stored in the third register, and a variable x5 is read from the third memory and is stored in the fifth register. Further, the first memory can be a two-port memory having one data writing port and one data reading port, and the second memory and the third memories can be single-port memories having one port for the writing and the reading of data.