The invention relates to the field of microprocessors, and, more particularly, to a modular arithmetic coprocessor that performs non-modular operations.
The Montgomery method makes it possible to carry out modular computations in a finite field (or Galois field) denoted as GF(2n), without the performance of divisions. Conventionally, modular operations on GF(2n) are used in cryptography for applications such as authentication of messages, identification of a user, and exchange of cryptographic keys. Exemplary applications are described in the French Patent Application No. 2,679,054.
There are commercially available integrated circuits dedicated to such applications. These include, for example, the product referenced as ST16CF54, which is manufactured by SGS-THOMSON MICROELECTRONICS. This product is built around a central processing unit and an arithmetic coprocessor, and is dedicated to the performance of modular computations. The coprocessor enables the processing of modular multiplication operations using the Montgomery method. Further information on this coprocessor can be found in the U.S. Pat. No. 5,513,133.
The basic operation, called a Pfield operation, is implemented by this coprocessor. Three binary data elements A (multiplicand), B (multiplier), and N (modulo) are encoded on a whole number n of bits. This is done for a binary data element denoted as Pfield(A, B)N which is encoded on n bits such that Pfield(A, B)N=A*B*I mod N. I is a binary data element, called an error, which is encoded on n bits such that I=2xe2x88x92n mod N. More specifically, the value of I depends on the number of k bit blocks considered for the encoding of A, with k being an integer. To perform the operation A*B*I mod N, the data elements are assumed to have been encoded on m words of k bits, with m and k being integers and m*k=n. The words of the data elements A and B are provided to a multiplication circuit having a series input to receive B, a parallel input to receive the k bit blocks of A, and a series output.
In the referenced U.S. Pat. No. 5,513,133, the coprocessor operates with k=32 and m=8 or 16. The coprocessor may be used to produce the result of the modular multiplication A*B mod N. The modular multiplication can be subdivided into two successive elementary Pfield operations. Pfield (Pfield(A, B)N, H)N is computed with H being a data element encoded on n bits, called an error correction parameter, which is equal to 22n mod N. For further details on the implementation of modular multiplication, reference may be made to the above referenced U.S. patent. Several possibilities of computation are already known. They include the use either a software method or a specialized circuit, such as the one illustrated in the referenced U.S. patent.
The circuit illustrated in FIG. 1 includes three shift registers 10, 11 and 12 with a series input and output. These registers include n number of cells, with n=m*k. Multiplexers 13, 14 and 15 are placed respectively before the inputs of the registers 10, 11 and 12. The circuit also includes three registers 16, 17 and 18 with a series input and a parallel output, with each register having k cells. Two multiplication circuits 19 and 20 include a series input, a parallel input, and a series output. The circuit further includes two k-cell registers 21 and 22, multiplexers 24, 25, 26, 36, 37 and 38, a demultiplexer 39, series subtraction circuits 27, 28 and 29, series addition circuits 30 and 31, delay cells 32, 33 and 34 to delay the propagation of binary data elements by k cycle periods, and a comparison circuit 35. For further details on the arrangements of the different elements with respect to each other, reference may be made to the referenced U.S. patent.
The use of the circuit shown in FIG. 1 enables optimizing in terms of computing duration, memory size, etc. of the processing of modular operations using a fixed data size, e.g., in this case 256 or 512 bits. Cryptography requires machines with increasingly high performance levels, operating at increasingly high speeds, and using increasingly complex cryptographic keys. The trend is towards the manipulation of data elements encoded on 768, 1024, 1536 and even 2048 bits. To process data elements of this size, it may be necessary to construct larger-size circuits by adapting the elements of the circuit to the sizes of the data.
This approach may raise problems in applications such as chip cards, wherein the size of the circuit is physically limited because of differences in mechanical bending stresses between the cards and the silicon substrates. Furthermore, it is becoming increasingly necessary to integrate larger numbers of different functional elements in a card of this kind. The space available for an encryption circuit is thereby correspondingly reduced. Approaches therefore need to be found to limit the increase in the size of this circuit while, at the same time, enabling optimum operation for data elements with a size greater than the size of the initially planned registers.
To carry out modular operations using operands with a size greater than that managed by the coprocessor, it is possible to use the circuit 1 shown in FIG. 2. In practice, the maximum size is equal to the size of the registers. Circuit 1 includes a standard processor 2 (8, 16 or 32 bits), a memory 3, the coprocessor 4 of FIG. 1, and a communications bus 5 used to connect the different elements 2, 3 and 4 together and/or external to the circuit 1. In the circuit of FIG. 2, the coprocessor 4 is used as a multiplier operating on m*k bits, which is conventionally 256 or 512 bits. The processor 2 is used, in particular, to supervise operations to be performed according to a particular encryption algorithm, and the data exchanges between the memory 3 and the coprocessor 4.
Performance of the basic operation of modular computations according to the Montgomery method, known as the Pfield operation, is based upon three binary data elements. These data elements are A (multiplicand), B (multiplier) and N (modulo), which are encoded on a whole number of n bits. They are used for the production of a binary data referenced as P(A, B)N encoded on n bits such that P(A, B)N=A*B*I mod N. I is an error due to the Montgomery method. Should n have a size greater than the size of the registers, namely m*k, it is appropriate to subdivide n into p words of Bt bits. Bt is a working base with a size smaller than or equal to m*k, e.g., m*k. The Montgomery method operates as follows. The variable i is an index varying from 0 to mxe2x88x921, and the following computation loop is repeated:
X=Si+Ai*B,
Y0=(X*J0) mod 2Bt,
Z=X+(N*Y0),
Si+1=Z 2Bt,   is a whole number division,
if Si+1 is greater than N, then N is subtracted from Si+1,
Ai corresponds to a word of Bt bits of the breakdown of A, and
Si corresponds to an updated result of the Pfield operation, and Sm=P(A, B)N=A*B*I mod N.
A computation method of this kind requires a larger number of data exchanges between the coprocessors 4 and the memory 3. The coprocessor 4 of FIG. 1 can carry out only simple operations of multiplication such as A*B=S. A and B are encoded on Bt bits and S is encoded on 2*Bt bits. One approach proposed in U.S. Pat. No. 5,987,489 includes the coprocessor 4 performing an operation of the type S=A*B+C, in which A, B and C are encoded on Bt bits, and S is encoded on 2*Bt bits.
FIG. 3 shows a coprocessor 4 according to the referenced U.S. Pat. No. 5,987,489. The coprocessor 4 illustrated in FIG. 3 includes three shift register 110, 111 and 112 with serial a input and a serial output. These registers include a number of n cells, and n=m*k, where n, m and k are integers. A multiplexer 113 includes three serial inputs and one serial output. The serial output is connected to the input of the register 110, the first input is connected to a first input terminal 150, and the second input is connected to the output of the register 110. A multiplexer 114 includes two serial inputs and one serial output. The serial output is connected to the input of the register 111, and the first input is connected to a second input terminal 151. A multiplexer 115 includes three serial inputs and one serial output. The serial output is connected to the input of the register 112, the first input is connected to a third input terminal 152, and the second input is connected to the output of the register 112.
The coprocessor 4 further includes three k-cell registers 116, 117 and 118 each having a serial input and a parallel output. The input of the register 117 is connected to a fourth input terminal 153. Two multiplication circuits 119 and 120 include a serial input, a parallel input to receive k bits, and a serial output. Two registers 121 and 122, for the storage of k cells, include a parallel input and a parallel output. The input of the register 121 is connected to the output of the register 116, the output of the register 121 is connected to the parallel input of the multiplication circuit 119, and the output of the register 122 is connected to the parallel input of the multiplication circuit 120.
A multiplexer 123 includes two parallel inputs and one parallel output. The first input of the multiplexer 123 is connected to the output of the register 117, the second input of the multiplexer 123 being connected to the output of the register 118, the output of the multiplexer 123 is connected to the input of the register 122. Two multiplexers 124 and 125 each include two serial inputs and one serial output. The output of the multiplexer 124 is connected to the input of the register 116, the first input of the multiplexer 124 is connected to a fifth input terminal 154, the output of the multiplexer 125 is connected to the serial input of the multiplication circuit 119, and the first input of the multiplexer 125 is for receiving a logic zero.
A multiplexer 126 includes three serial inputs and one serial output. The output is connected to the serial input of the multiplication circuit 120, and the first input is for receiving a logic zero. Subtraction circuits 127, 128 and 129 each include two serial inputs and one serial output. The first input of the circuit 127 is connected to the output of the register 110, the output of the circuit 127 is connected to each of the second inputs of the multiplexers 124 and 125 and also to an output terminal 155, and the first input of the circuit 128 is connected to the output of the register 111.
An addition circuit 130 includes two serial inputs and one serial output. The first input of the circuit 130 is connected to the output of the circuit 119, and the output of the circuit 130 is connected to the second input of the multiplexer 126. An addition circuit 131 includes two serial inputs, one serial output and one carry output. The carry output of the circuit 131 is connected to the first input of the circuit 129. Delay cells 132, 133 and 134 delay the propagation of binary data by k cycle times. These cells are typically k bit shift registers. These cells include one serial input and one serial output. The output of the cell 132 is connected firstly to the third input of the multiplexer 126 and secondly to the input of the cell 133. The output of the cell 133 is connected to the second input of the circuit 129. The input of the cell 134 is connected to the output of the circuit 130, and the output of the cell 134 is connected to the first input of the circuit 131.
A comparison circuit 135 includes two serial inputs and two outputs. The first input is connected to the output of the circuit 131, and the second input is connected to the output of the circuit 129. Two multiplexers 136 and 137 each include two serial inputs, one selection input and one serial output. Each of the first inputs are for receiving a logic zero. Each of the selection inputs are connected to one of the outputs of the circuit 135. The output of the multiplexer 136 is connected to the second input of the circuit 127, and the output of the multiplexer 137 is connected to the second input of the circuit 128.
A multiplexer 138 includes two serial inputs and one serial output. The first input is for receiving a logic 1, the second input is connected to the output of the register 112, and the output is connected firstly to the input of the cell 32 and secondly to the second inputs of the multiplexers 136 and 137. A demultiplexer 139 includes one serial input and two serial outputs. The input is connected to the output of the circuit 120, and the outputs are connected respectively to the input of the register 118 and to the second input of the circuit 131. A multiplexer 140 includes two serial inputs and one serial output. The first input is connected to the output of the circuit 128, the second input is for receiving a logic 0, and the output is connected to the second input of the circuit 130. A multiplexer 141 includes two serial inputs and one serial output. The first input is connected to the output of the circuit 130, the second input is connected to the output of the circuit 131, and the output is connected to the third inputs of the multiplexers 113 and 115 and to the second input of the multiplexer 114. Two output terminals 156 and 157 are respectively connected to the outputs of the registers 111 and 112.
FIG. 3 shows a coprocessor 4 according to the referenced U.S. Pat. No. 5,987,489. The coprocessor 4 illustrated in FIG. 3 includes three shift register 110, 111 and 112 with serial a input and a serial output. These registers include a number of n cells, and n=m*k, where n, m and k are integers. A multiplexer 113 includes three serial inputs and one serial output. The serial output is connected to the input of the register 110, the first input is connected to a first input terminal 150, and the second input is connected to the output of the register 110. A multiplexer 114 includes two serial inputs and one serial output. The serial output is connected to the input of the register 111, and the first input is connected to a second input terminal 151. A multiplexer 115 includes three serial inputs and one serial output. The serial output is connected to the input of the register 112, the first input is connected to a third input terminal 152, and the second input is connected to the output of the register 112.
In the referenced U.S. Pat. No. 5,987,489 one alternative variation shows a circuit that enables the performance of the elementary operation S=A*B+C+D, with A, B, C and D encoded on Bt bits and S encoded on 2*Bt bits. An object of this alternative variation is to carry out a multiplication on p*Bt bits, and an addition on p*Bt bits simultaneously to obtain the computation of X=Si+Ai*B and Z=X+(N*Y0) of the Montgomery algorithm at a higher speed.
If the Montgomery algorithm set up by elementary operations of the S=A*B+C+D type is developed, the following loop repetition is obtained.
A) Computation of X=Si+Ai*B for providing Xp . . . X0=Si,pxe2x88x921 . . . Si,0+Ai*Bpxe2x88x921 . . . B0, with Xj, Si,j and Bj being the Bt bit words of X, Si and B. This is a result of the succession of the following p computations made in the coprocessor 4:
A1) Xxe2x80x21X0=Si,0+Ai*B0+0
A2) Xxe2x80x22X1=Si,1+Ai*B1+Xxe2x80x21 . . .
Apxe2x88x921) Xxe2x80x2pxe2x88x921Xpxe2x88x922=Si,pxe2x88x922+Ai*Bpxe2x88x922+Xxe2x80x2pxe2x88x922
Ap) XpXpxe2x88x921=Si,pxe2x88x921+Ai*Bpxe2x88x921+Xxe2x80x2pxe2x88x921 
Xxe2x80x21 to Xxe2x80x2pxe2x88x921 are Bt bit words of intermediate computation that remain permanently in the coprocessor 4.
B) Y0=(X*J0) mod 2Bt for providing Y0=(Xp . . . X0*J0) mod 2Bt, by the following computation made in the coprocessor 4: Yxe2x80x21Y0=X0*J0+0. The least significant word Y0 is the only one of interest.
C) Z=X+N*Y0 for providing Zp . . . Z0=Xp . . . X0+Y0*Npxe2x88x921 . . . N0. Zj, Xj and Nj are the Bt bit words of Z, X and N using the following succession of p+1 computations made in the coprocessor 4:
C1) Zxe2x80x21Z0=X0+Y0*N0+0
C2) Zxe2x80x22Z1=X1+Y0*N1+Zxe2x80x21 . . .
Cpxe2x88x921) Zxe2x80x2pxe2x88x921Zpxe2x88x922=Xpxe2x88x922+Y0*Npxe2x88x922+Zxe2x80x2pxe2x88x922 
Cp) Zxe2x80x2pZpxe2x88x921=Xpxe2x88x921+Y0*Npxe2x88x921+Zxe2x80x2pxe2x88x921 
Cp+1) Zp=Xp+0*0+Zxe2x80x2p 
Zxe2x80x21 to Zxe2x80x2p are Bt bit words of intermediate computation that remain permanently in the coprocessor 4.
D) Si+1=Z 2Bt,   is an integer division. If Si+1 is greater than N, then N is subtracted from Si+1.
An object of the invention is to improve the computation time by eliminating the computation identified as Cp+1 by creating a new S=A*B+C type operation, with S and C encoded on 2*Bt bits and A and B encoded on Bt bits. To carry out this new operation, an overflow storage flip-flop circuit has been added to store a possible overflow at the end of an elementary computation and reinsert the overflow, if any, during the next computation.
Another object of the invention is to provide a computation circuit to carry out an operation A*B+C. A and B are integers encoded on at most m*k bits. C is an integer encoded on at most 2*m*k bits, with m and k being non-zero integers. The computation circuit includes first, second and third (m*k) bit registers for storing data. A fourth k bit register stores a data element. A first multiplication circuit carries out operations of multiplication between the data elements of the first and fourth registers. Addition means carry out an addition of the data elements of the second and third registers, and the result is provided by the multiplication circuit. There are means to store a carry value, if any, resulting from an overflow of the addition. Linking means provide an intermediate result provided by the addition means in the second and third registers. The linking provides the carry value stored during a previous addition to the addition means. This is done to add the carry value in the place of the least significant word which is to be added as soon as the least significant word has been added.
According to one approach, the computation circuit comprises a fifth (m*k) bit register to successively provide k bit words to the fourth register.
The invention also provides that the performance of the same elementary operations is obtained by using the two multipliers in parallel to reduce the computation time by two. The computation circuit comprises a second multiplication circuit for the performance, simultaneously with the first multiplication circuit, of the multiplication of the data element of the first register with a data element of a sixth k bit register. The addition means or adder carries out the addition, with a k bit shift, of the result provided by the second multiplication circuit.
The invention also relates to a modular arithmetic coprocessor including implementation of the modular operations on numbers encoded on m*k bits, with m and k being integers, and the previously defined computation circuit. More generally, the invention relates to a modular computation device including a processor, a memory, and the coprocessor disclosed herein.
Furthermore, another object of the invention is to provide a method for the computation of A*B+C. A and B are integers encoded on at most m*k bits. C is an integer encoded on at most 2*m*k bits, with m and k being non-zero integers. In a multiplication circuit, a data element of a first (m*k) bit register is multiplied by a data element of a fourth k bit register. Data elements of a second (m*k) bit register and a third (m*k) bit register are added with the result provided by the multiplication circuit. A carry value, if any, results from an overflow of the addition stored. An intermediate result is stored in the second and third registers. The previous operations are repeated for changing the data element of the fourth register and adding the carry value, if any, stored in the place of the least significant word to be added as soon as the least significant word has been added.
In one embodiment, an operand is stored entirely in a fifth (m*k) bit register to provide the operand successively to the fourth register. To divide the time needed to perform the method by two, a second multiplication is performed in parallel. The result of this multiplication is added with a k bit shift.
More generally, the invention relates to a method for the computation of modular operations on operands of a size greater than m*k bits in which the operands are processed in m*k bit words by using the method of the invention.