The invention relates to the field of microprocessors, and, more particularly, to a modular arithmetic coprocessor.
The Montgomery method makes it possible to carry out modular calculations in a finite field (or Galois field) denoted as GF(2n), without the performance of divisions. Conventionally, modular operations on GF(2n) are used in cryptography for applications such as authentication of messages, identification of a user, and exchange of cryptographic keys. Exemplary applications are described in the French Patent No. 2,679,054.
There are commercially available integrated circuits dedicated to such applications. These include, for example, the product referenced as ST16CF54, which is manufactured by SGS-THOMSON MICROELECTRONICS. This product is built around a central processing unit and an arithmetic coprocessor, and is dedicated to the performance of modular computations. The coprocessor enables the processing of modular multiplication operations by using the Montgomery method. Further information on this coprocessor can be found in U.S. Pat. No. 5,513,133.
The basic operation, called a Pfield operation, is implemented by this coprocessor. Three binary data elements A (multiplicand), B (multiplier) and N (modulo) are encoded on a whole number of n bits. This is done for a binary data element denoted as Pfield(A, B)N which is encoded on n bits such that Pfield(A, B)N=A*B*I mod N. I is a binary data element, called an error, which is encoded on n bits such that I=2xe2x88x92n mod N. More specifically, the value of I depends on the number of k bit blocks considered for the encoding of A, with k being an integer. To perform the operation A*B*I mod N, the data elements are assumed to have been encoded on m words of k bits, with m and k being integers and with m*k=n.
In the referenced U.S. Pat. No. 5,313,133, the coprocessor operates with k=32 and m=8 or 16. The coprocessor may be used to produce the result of the modular multiplication A*B mod N. The modular multiplication can be subdivided into two successive elementary Pfield operations. Pfield(A, B)N, H)N is computed with H being a data element encoded on n bits, called an error correction parameter, which is equal to 22n mod N. For further details on the implementation of modular multiplication, reference may be made to the above referenced U.S. patent.
The use of the coprocessor disclosed in the referenced U.S. patent enables optimizing in terms of computing duration, memory size, etc., of the processing of modular operations using a fixed data size, e.g., 256 or 512 bits. Cryptography requires machines with increasingly high performance levels, operating at increasingly high speeds and using increasingly complex cryptographic keys. The trend is towards the manipulation of data elements encoded on 768, 1024, 1536 and even 2048 bits. To process data elements of this size, it may be necessary to form larger-sized circuits by adapting the elements of the circuit to the sizes of the data. This approach may raise problems in applications such as chip cards, wherein the size of the circuit is physically limited because of differences in mechanical bending stresses between the cards and the silicon substrates. Furthermore, it is becoming increasingly necessary to integrate ever larger numbers of different functional elements in a card of this kind. The space available for an encryption circuit is thereby correspondingly reduced. Approaches, therefore, need to be found to limit the increase in the size of this circuit while, at the same time, enabling optimum operation for data elements with a size greater than the size of the initially planned registers.
To carry out modular operations using operands with a size greater than that managed by the coprocessor, it is possible to use the circuit 1 shown in FIG. 1. In practice, the maximum size is equal to the size of the registers. Circuit 1 includes a standard processor 2 (8, 16 or 32 bits), a memory 3, a coprocessor 4 and a communications bus 5 used to connect the different elements 2, 3 and 4 together and/or external to the circuit 1. In circuit 1 of FIG. 1, the coprocessor 4 is used as a multiplier operating on m*k bits, which is conventionally 256 or 512 bits. The processor 2 is used, in particular, to supervise the operations to be performed according to a particular encryption algorithm, and the data exchanges between the memory 3 and the coprocessor 4.
Performance of the basic operation of modular computations according to the Montgomery method, known as the Pfield operation, is based upon three binary data elements. These binary data elements are A (multiplicand), B (multiplier) and N (modulo), which are encoded on a whole number of n bits. They are used for the production of a binary data element referenced as P(A, B)N, which is encoded on n bits such that P(A, B)N=A*B*I mod N. I is an error due to the Montgomery method. Should n have a size greater than the size of the registers, namely m*k, it is appropriate to subdivide n into p words of Bt bits. Bt is a working base with a size smaller than or equal to m*k, e.g., m*k. The Montgomery method operates as follows. The variable i is an index varying from 0 to mxe2x88x921, and the following computation loop is repeated m times:
X Si+Ai*B,
Y0=(X*J0) mod 2Bt,
Z=X+(N*Y0),
Si+1=Z 2Bt,  
is a whole number division, and if Si+1 is greater than N, then N is subtracted from Si+1,
Ai corresponds to a word of Bt bits for the breakdown of A, and
Si corresponds to an updated result of the Pfield operation, and Sm=P(A, B)N=A*B*I mod N.
A computation method of this kind requires a large number of data exchanges between the coprocessor 4 and the memory 3, and requires the memory 3 to be sized as a function of the data elements to be stored during intermediate computations. The European Patent 793,165, which corresponds to U.S. patent application having Ser. No. 08/806,456, explains how to perform non-modular multiplication operations and operations of the type S=A*B+C or S=A*B+C+D. A, B, C and D are encoded on Bt bits with Bt equal at most to m*k. S is encoded on 2*Bt bits.
If the Montgomery algorithm set up by means of elementary operations of the S=A*B+C+D type is developed, then the following loop repetition is obtained.
A) Computation of X=Si+Ai*B for providing Xp . . . X0=Si,pxe2x88x921 . . . Si,0+Ai*Bpxe2x88x921 . . . B0, with Xj, Si,j and Bj being the Bt bit words of X, Si and B. This is a result of the succession of the p computations made in the coprocessor 4.
A1) Xxe2x80x21X0=Si,0+Ai*B0+0
A2) Xxe2x80x22X1=Si,1+Ai*B1+Xxe2x80x21 
. . .
Apxe2x88x921) Xxe2x80x2pxe2x88x921Xpxe2x88x922=Si,pxe2x88x922+Ai*Bpxe2x88x922+Xxe2x80x2pxe2x88x922 
Ap) XpXpxe2x88x921=Si,pxe2x88x921+Ai*Bpxe2x88x921+Xxe2x80x2pxe2x88x921, with Xxe2x80x21 to Xxe2x80x2pxe2x88x921 being Bt bit words of intermediate computation that remain permanently in the coprocessor 4.
B) Y0=(X*J0) mod 2Bt, for providing Y0=(Xp . . . X0*J0) mod 2Bt by the following computation made in the coprocessor 4: Yxe2x80x21Y0=0+X0*J0+0. The least significant word Y0 is the only one of interest.
C) Z=X+N*Y0, for providing Zp . . . Z0=Xp . . . X0+Y0*Npxe2x88x921 . . . N0. Zj, Xj and Nj are the Bt bit words of Z, X and N using the following succession of p+1 computations made in the coprocessor 4.
C1) Zxe2x80x21Z0=X0+Y0*N0+0
C2) Zxe2x80x22Z1=X1+Y0*N1+Zxe2x80x21 
. . .
Cpxe2x88x921) Zxe2x80x2pxe2x88x921Zpxe2x88x922=Xpxe2x88x922+Y0*Npxe2x88x922+Zxe2x80x2pxe2x88x922 
Cp) Zxe2x80x2pZpxe2x88x921=Xpxe2x88x921+Y0*Npxe2x88x921+Zxe2x80x2pxe2x88x921 
Cp+1) Zp=Xp+0*0+Zxe2x80x2p. Zxe2x80x21 to Zxe2x80x2p are Bt bit words of intermediate computation that remain permanently in the coprocessor 4.
D) Si+1=Z 2Bt,   is an integer division. If Si+1 is greater than N, then N is subtracted from Si+1.
The computation of Y0=(X*J0) mod 2Bt is done by a non-modular operation which computes Yxe2x80x21Y0. The modulo 2Bt operation is then performed by abandoning the most significant Bt word Yxe2x80x21. The implementation of the computation of Y0 amounts to carrying out the unnecessary computation of Yxe2x80x21.
An object of the invention is to implement the operation Y0=(X*J0) mod 2Bt directly within a coprocessor 4 to eliminate the need for a register of Bt=m*k bits within the processor 4. This eliminated register thus enables the storage of a data element during the computation of Y0.
Another object of the invention is to provide a method for implementing a modular operation of the type S=A*B mod 2m*k. A, B and S are encoded at most on m words of k bits. The integers m and k are greater than 1. A is stored successively by k bit words in a first k bit register, and B is stored in a second m*k bit register. A first multiplication circuit is connected to the first and second registers. The first multiplication circuit performs successive operations of the multiplication of B by each word of A stored in the first register. The word A contained in the first register is changed at each multiplication. In a third m*k bit register, there are stored the k bit words of S obtained successively at the end of each multiplication, and the words of an intermediate result are also stored. The intermediate result corresponds to the least significant words of the result of the multiplication of B by the word A to which a previous intermediate result has been added in a first addition circuit.
Preferably, m successive operations of multiplication are performed with i as an integer index varying from 0 to mxe2x88x921. Ri is an intermediate result equal to Ai*Bmxe2x88x921xe2x88x92i . . . B0+(Rixe2x88x921 2k). Ai is the k bit words with the place value i of A. Bmxe2x88x921xe2x88x92i . . . B0 corresponds to the mxe2x88x92i least significant bits of B. Rxe2x88x921 is equal to zero. At the end of the ith multiplication, the third register contains, firstly, the i+1 least significant k bit words of S, and secondly, the mxe2x88x92i least significant words of Ri.
First and second embodiments accelerate operation of the method of the invention. In the first embodiment, A is stored successively in k bit words in a fourth k bit register. In a second multiplication circuit connected to the second and fourth registers, successive operations are performed for the multiplication of B by each word of A stored in the fourth register. The word of A contained in the fourth register is changed for each multiplication operation. In a second addition circuit having a delay, the result output from the second multiplication circuit is added with the result output from the first addition circuit before storing in the third register.
In the second embodiment, the size of the second and third registers is made to vary between each new multiplication performed by the first multiplication circuit. Preferably, m/2 successive multiplication operations are performed in each of the first and second multiplication circuits. An integer i varies from 0 to m/2. Ri is an intermediate result equal to A2*i+1A2*i*Bmxe2x88x921xe2x88x922*i . . . B0+(R)ixe2x88x921 22*k). A2*i and A2*i+1 are the k bit words with place values 2*i and 2*i+1 of A. Bmxe2x88x921xe2x88x922*i . . . B0 corresponds to the mxe2x88x922*i least significant words of B. Rxe2x88x921 is equal to zero. At the end of the ith multiplication, the third register contains, firstly, the 2*i+2 least significant k bit words of S, and the mxe2x88x922*i least significant words of Ri.
To further limit exchanges with the memory, A is stored beforehand in a fifth m*k shift register. This makes it possible to provide the first register or the fourth register with the k bit words of A to be loaded into the first or fourth registers.
The invention also relates to a modular arithmetic coprocessor comprising a circuit for the implementation of a modular operation of the S=A*B mod 2m*k type. A, B and S are encoded on at most m words of k bits, with m and k being integers greater than 1. The implementation circuit includes a first k bit register for storing A successively by k bit words. A second m*k bit register is provided for storing B. A first multiplication circuit is connected to the first and second registers for performing successive operations of multiplication of B by each word of A stored in the first register. The word of A contained in the first register is changed during each multiplication. A third m*k bit register stores the k bit words of S obtained successively at the end of each multiplication. The words of an intermediate result correspond to the least significant words of the result of the multiplication of B by the word of A, wherein a previous intermediate result has been added in a first addition circuit.
The embodiments of the method take the form of the addition of a fourth k bit register to store the k bit words of A successively. A second multiplication circuit is connected to the second and fourth registers for performing successive operations of multiplication of B by each word of A stored in the fourth register. The word of A contained in the fourth register is changed during each multiplication. A second addition register having a delay adds the result output from the second multiplication circuit with the result output from the first addition circuit before storing in the third register. The second and third registers have a variable size, and their size varies during the implementation of a modular operation of the S=A*B mod 2m*k type. Furthermore, the implementation circuit may include a fifth m*k bit shift register for enabling the supply to the first or fourth registers of the k bit words of A to be loaded into the first or fourth registers.
More generally, the invention relates to an integrated circuit implementing modular operations according to the Montgomery method that comprises a processor, a memory and a coprocessor of the invention.