1. Field of the Invention
The present invention relates to an arithmetic device for performing multiple precision arithmetic for Montgomery multiplication residue arithmetic.
2. Description of the Related Art
Recently, encrypted data communications and user authentication (digital signature) are commonly used as security policies over a computer network. In these security policies, a cryptographic process such as RSA (Rivest-Shamir-Adleman) cryptography, elliptical curve cryptography, etc. are popularly used.
Since numerics of a large number of bits are processed in the RSA cryptography and the elliptical curve cryptography, a multiple precision arithmetic algorithm which can be easily realized by software and hardware is frequently used in Montgomery multiplication residue arithmetic used in the above-mentioned cryptographic processes. A multiple precision arithmetic algorithm refers to an algorithm of repeating operations in block units after dividing a numeric of a large number of bits into blocks of a number of bits easily processed by software and hardware.
The algorithm as shown in FIG. 1A is well known (for example, refer to Patent Literature 1: Japanese Patent Application Laid-open No. H11-212456) as multiple precision arithmetic algorithm for Montgomery multiplication residue arithmetic. In FIG. 1A, A and B are integers, N is a modulus of residue arithmetic, and Y is an arithmetic result. The algorithm shown on the left of FIG. 1A can be transformed into one shown on the right thereof. The algorithm indicates the operation of Y=AB2−kg mod N, and includes the processes (1) through (6).
“g” indicates the number of blocks of A, B, and N when A, B, and N are divided by k bits=1 block, and the number of blocks of Y is g+1. ai, bi, ni, and yi respectively indicate the i-th block of A, B, N, and Y. A, B, N, and Y are expressed as follows using ai, bi, ni, and yi.A=(ag−1, ag−2, . . . , a1, a0)B=(bg−1, bg−2, . . . , b1, b0)N=(ng−1, ng−2, . . . , n1, n0)Y=(yg, yg−1, . . . , y1, y0)
2−kg is a reciprocal of 2kg with N as the modulus of residue arithmetic. n′0 indicates the least significant block (0-th block) obtained when N′ satisfying R·R−1−N·N′=1 (0≦R−1<N, 0≦N′<R) where R=2kg is divided by k bits=1 block. c1, c2, m, and tmp are k-bit work variables, and i and j are loop variables.
When the algorithm shown in FIG. 1A except for the process (6) is represented by a circuit, it is considered that the circuit as shown in FIG. 1B is the most likely to be realized. In the circuit shown in FIG. 1B, each of the A, B, and N are divided into g blocks, and each block is input to a block unit arithmetic unit 101. Arithmetic results y′i and y′i−1 are stored in the blocks yi and yi−1 of Y, respectively. At this time, yi required in the operation is read from Y, and input to the block unit arithmetic unit 101.
The block unit arithmetic unit 101 is provided with registers 111, 112, 113, 114, 115, 121, 122, 123, 124, 125, 126, and selectors 116, 117, 118, and 119, and a multiplier-adder 120 as shown at the lower portion in FIG. 1B. The multiplier-adder 120 performs the operation of A×B+C+X. The relationship between the circuit shown in FIG. 1B and the processes (1) through (5) of the algorithm shown in FIG. 1A is described below.(c1, tmp)=ai*bj+yi+c1   Process (1)
The selector 116 selects A=ai, the selector 117 selects B=bj, the selector 118 selects C=c1, and the selector 119 selects D=yi. The most significant k bits of the output of the multiplier-adder 120 are stored in the register 121 as c1, the least significant k bits are stored in the register 126 as tmp.m=tmp*n′0   Process (2)
The selector 116 selects A=tmp, the selector 117 selects B=n′0, the selector 118 selects C=0, the selector 119 selects D=0, the least significant k bits of the output of the multiplier-adder 120 are stored in the register 125 as m.(c2, tmp)=m*nI+tmp*c2   Process (3)
The selector 116 selects A=ni, the selector 117 selects B=m, the selector 118 selects C=tmp, the selector 119 selects D=c2, the most significant k bits of the output of the multiplier-adder 120 are stored in the register 122 as c2, and the least significant k bits are stored in the register 126 as tmp.(c2, yi−1)=m*nI+tmp+c2   Process (4)
The selector 116 selects A=ni, the selector 117 selects B=m, the selector 118 selects C=tmp, the selector 119 selects D=c2, the most significant k bits of the output of the multiplier-adder 120 are stored in the register 122 as c2, and the least significant k bits are stored in the register 124 as y′i−1. The contents of the register 124 are stored in the yi−1 of Y.(yi, yi−1)=yi+c1+c2   Process (5)
The selector 116 selects A=y′i, the selector 117 selects B=1, the selector 118 selects C=c1, the selector 119 selects D=c2, the most significant k bits of the output of the multiplier-adder 120 are stored in the register 123 as y′i, and the least significant k bits are stored in the register 124 as y′i−1. The contents of the registers 123 and 124 are stored in the yi and yi−1 of Y, respectively.
When the process (6) of the algorithm shown in FIG. 1A is added to the circuit shown in FIG. 1B to complete the circuit as a circuit of a multiple precision arithmetic algorithm for Montgomery multiplication residue arithmetic, as in the processes other than the process (6), a multiple precision subtraction as shown in FIG. 1C in which Y and N are divided by k bits=1 block and an operation is repeated on each block is adopted, and the circuit shown in FIG. 1D is obtained in most cases.
FIG. 1D shows a circuit obtained by transforming the multiple precision subtraction shown at the upper portion in FIG. 1C into the addition shown at the lower portion in FIG. 1C using the two's complement representation. A two's complement is used to represent a negative value in a computer, and obtained by adding 1 to a one's complement. A one's complement can be obtained by inverting a bit pattern.
In FIG. 1D, the block unit arithmetic unit 101 shown in FIG. 1B is replaced with a block unit arithmetic unit 131. The block unit arithmetic unit 131 has a configuration obtained by adding an inverter/non-inverter 141 to the block unit arithmetic unit 101.
In the operations of the circuit shown in FIG. 1D, the operation corresponding to the process (normal process) other than the process (6) of the algorithm shown in FIG. 1A is the same as that of the circuit shown in FIG. 1B. Therefore, only the operation corresponding to the process (last process) (6) is described below. In this process, the inverter/non-inverter 141 inverts and outputs ni according to a last process status signal, y′i of the register 123 is initialized to 1, and y′i−1 output from the block unit arithmetic unit 131 is stored in yi of Y.    (1) y′i is initialized to 1.    (2) The following operations are repeated in the range of 0≦i≦g.
The selector 116 selects A=inversion (one's complement) of ni, the selector 117 selects B=1, the selector 118 selects C=y′i (carry from the result of yi−1−ni−1), the selector 119 selects D=yi, and the most significant k bits of the output of the multiplier-adder 120 are stored in the register 123 as y′i.    (3) When bit 0 (least significant bit) of y′i is 1 (Y≧N), control is passed to (4). When bit 0 of y′i is 0 (Y<N), control is passed to (5).    (4) The following operations are repeated in the range of 0<i<g.
The selector 116 selects A=inversion (one's complement) of ni, the selector 117 selects B=1, the selector 118 selects C=y′i (carry from the result of yi−1−ni−1) the selector 119 selects D=yi, the most significant k bits of the output of the multiplier-adder 120 are stored in the register 123 as y′i, and the least significant k bits are stored in the register 124 as y′i−1. The contents of the register 124 are stored in yi of Y.    (5) The operations are terminated.
However, the above-mentioned hypothetical multiple precision arithmetic circuit has the following problems.
In the circuit shown in FIG. 1D, the circuit storing A, B, N, Y, c1, c2, m, and tmp is normally configured by a RAM (random access memory) or FF (flip-flop) which is synchronous with an operation clock. Therefore, the multiplier-adder 120 and the selectors 116 through 119 immediately before the multiplier-adder 120 are in the way between the output (A, B, N, Y, c1, c2, m, and tmp) of the RAM or FF and the input (Y, c1, c2, m, and tmp) of the RAM or FF. Therefore, if the total delay time of the multiplier-adder 120 and the selectors 116 through 119 is shorter than the period of the operation clock, then the circuit does not totally operate.
Therefore, the bottleneck in improving the operation frequency of the circuit shown in FIG. 1D is the maximum delay path <ni→inverter/non-inverter 141→selector 116→multiplier-adder 120→(c1, c2, y′i, yi−1, m, tmp)> of the block unit arithmetic unit 131, and the problem is to shorten the path.