1. Field of the Invention
The invention relates to a modular arithmetic coprocessor comprising two multiplication circuits working in parallel. More specifically, the invention relates to the improvement of a known arithmetic coprocessor enabling the performance of modular operations according to the Montgomery method in order to extend the applications of this coprocessor. The Montgomery method performs modular computations in a finite field denoted GF(2.sup.n) without the performance of divisions.
2. Description of the Prior Art
Conventionally, modular operations on GF(2.sup.n) are used in cryptography for applications such as authentication of messages, identification of a user and exchange of keys. Such exemplary applications are described for example in French patent application published under No. 2 679 054.
There are commercially available integrated circuits dedicated to such applications. These include, for example the device manufactured by SGS-THOMSON MICROELECTRONICS S.A. as model number ST16CF54, built around an association of the type including a central processing unit and an arithmetic coprocessor and dedicated to performing modular computations. The coprocessor used enables processing of the modular operations by the use of the Montgomery method. It is the object of European patent application No. 0 601 907 A2, hereinafter called the document D2 which in incorporated herein by reference. This coprocessor is illustrated in FIG. 1 (this figure corresponds to FIG. 2 of the document D2).
The basic operation (called a P.sub.field operation) implemented by this processor consists of producing, on the basis of three pieces of binary data A (multiplicand), B (multiplier) and N (modulo) encoded on an integer of n bits, of a piece of binary data denoted P.sub.field (A, B).sub.N encoded on n bits, such that P.sub.field (A, B).sub.N =A * B * I mod N. I is a piece of binary data called an error and encoded on n bits such that I=2.sup.-n mod N (more specifically, the value of I depends on the number of k-bit blocks, with k as an integer, considered for the encoding of A). To carry out the operation A * B * I mod N, it is assumed that the pieces of data are encoded on m words of k bits (m and k being integers), with m * k=n. The words of the pieces of data A and B are given to a multiplication circuit 19 having a series input to receive B, a parallel input to receive the k-bit blocks of A, and a series output.
In the coprocessor described in the document D2, k=32 and m=8 or 16. This device may be used to produce the result of the modular multiplication A * B mod N. The modular multiplication can be broken down into two successive Pfield elementary operations. P.sub.field (P.sub.field (A,B).sub.N, H).sub.N, is computed with H as a piece of data encoded on n bits, called an error correction parameter, and equal to 2.sup.2n mod N. For further details on the implementation of the modular multiplication, reference may be made to the document D2. Several possibilities of computation are already known. They consist in using either a software method or a specialized circuit such as the one illustrated in the document D2.
The coprocessor 4a illustrated in FIG. 1 comprises:
three shift registers 10, 11 and 12 with series input and output, these registers having a number n of cells, with n=m * k. PA1 multiplexers 13, 14 and 15 placed respectively before the registers 10, 11 and 12. PA1 three registers 16, 17 and 18 with series input and parallel output, and comprising k cells; PA1 two multiplication circuits 19 and 20, comprising one series input, one parallel input and one series output; PA1 two registers 21 and 22 with k cells; PA1 multiplexers 24, 25, 26, 36, 37 and 38; PA1 a demultiplexer 39; PA1 series subtraction circuits 27, 28 and 29; PA1 series addition circuits 30 and 31; PA1 delay cells 32, 33 and 34 to delay the propagation of pieces of binary data by k cycle periods; PA1 a comparison circuit 35. PA1 the software computation of a parameter J.sub.0 defined by the relationship[(N * J.sub.0)+1 ]mod 2.sup.k =0 and the serial loading of the parameter J.sub.0 into the register 17, PA1 the serial loading of B into the register 10, the serial loading of N into the register 12, the serial loading of A.sub.0 into the register 16 and the serial loading of n consecutive logic zeros into the register 11, PA1 the initialization of the two multiplication circuits 19 and 20, the three subtraction circuits 27, 28 and 29, and the two addition circuits 30 and 31. PA1 the parallel loading into the register 21 of the contents of the register 16, PA1 the performance of the different elementary operations in order to perform the following computations: EQU X(i)=S(i-1)+B * A.sub.i-1 EQU Y.sub.0 (i)=[X(i) * J.sub.0 ]mod 2.sup.k EQU Z(i)=X(i)+(N * Y.sub.0 (i)) EQU S(i)=Z/2.sup.k, / being the integer division, PA1 the loading of k logic zeros into the register 17, the loading of B into the register 10, the loading of n logic zeros into the registers 11 and 12, the loading of the word A.sub.0 of k bits of A into the register 16, PA1 the initialization of the multiplication circuit 19, the initialization of the addition and subtraction circuits 28, 30 and 31. PA1 the loading of the contents of the register 16 into the register 21, PA1 the performance, by a simultaneous rightward shift of the registers 10 and 11, of the multiplication of A.sub.i-1 by B and of the addition of the contents of the register 11 with the result of the product, PA1 the storage of the k least significant bits into the register 12 by a k-bit rightward shift, and the storage of the n most significant bits of the result in the register 11, PA1 the loading of the word A.sub.i into the register 16 (preferably during the running of one or more of the preceding steps). PA1 computation of R[1].sub.0 +R[0].sub.1 and storage of the result referenced X[1], PA1 computation of c.sub.1 +R[1].sub.1 +R[0].sub.2 (c.sub.1 being the carry value of the previous calculation) and storage of the result referenced X[2], PA1 computation of c.sub.q-2 +R[1].sub.q-2 +R[0].sub.q-1 and storage of the result referenced X[q-1], PA1 computation of c.sub.q-1 +R[1].sub.q-1 and storage of the result referenced X[q]. PA1 computation of T[1].sub.0 +T[0].sub.1 and storage of the result referenced U[1], PA1 computation of c.sub.1 +T[1].sub.1 +T[0].sub.2 (c.sub.1 being the carry value of the previous computation) and storage of the result referenced U[2], PA1 computation of c.sub.q-2 +T[1].sub.q-2 +T[0].sub.q-1 and storage of the result referenced U[q-1], PA1 computation of c.sub.q-1 +T[1].sub.q-1 and storage of the result referenced U[q]. PA1 Computation of U+X and storage of the result, referenced Z. PA1 storage of S(1)=Z/2.sup.k -(N or 0). PA1 computation of R[1].sub.0 +R[0], and storage of the result referenced W[1], PA1 computation of c.sub.1 +R[1].sub.1 +R[0].sub.2 (c.sub.1 being the carry value of the previous calculation) and storage of the result referenced W[2], PA1 computation of c.sub.q-2 +R[1].sub.q-2 +R[0].sub.q-1 and storage of the result referenced W[q-1], PA1 computation of c.sub.q-1 +R[1].sub.q-1 and storage of the result referenced W[q]. PA1 computation of W+S(1) which then gives the result of the multiplication X with X having the form X[q] X[q-1] . . . X[1] X[0] and W=W[q] W[q-1] . . . W[1] W[0] with W[0]=R[0].sub.0. PA1 a first register, a second register, a third register, a fourth register and a fifth register, PA1 at least one input terminal to receive binary pieces of data to be stored in these registers, PA1 a first multiplication circuit that performs a multiplication operation between two pieces of data stored in the first and third registers, PA1 a second multiplication circuit that performs a multiplication operation between two pieces of data stored in the first and fourth registers, PA1 a first addition circuit that performs operations of addition between a piece of data stored in the second register and a piece of data produced by the first multiplication circuit, PA1 a second addition circuit that performs an operation of addition between a piece of data produced by the first addition circuit and a piece of data given to the second addition circuit by the second multiplication circuit, PA1 a delay cell to delay the supply to the second addition circuit of the piece of data given by the second multiplication circuit, PA1 multiplexing means that selectively supplies, to inputs of the first addition circuit, the contents of the second register or a permanent logic state, the connection of an input of the second multiplication circuit to an output of the first register, the connection of the output of the first multiplication circuit to one of the registers and the supply to the second addition circuit of a piece of data produced by the first addition circuit or a permanent logic state. PA1 loading the pieces of data A and B into first and second n-bit registers with series input and output, and loading the words A.sub.0 and A.sub.1 into third and fourth k-bit registers with series input and parallel output, PA1 initializing first and second addition circuits and of first and second multiplication circuits, PA1 selecting a first input of a first multiplexer so that it permanently supplies logic zeros to a first series input of the first addition circuit, PA1 selecting an input of a second multiplexer so that the pieces of data produced by the second multiplication circuit are given with a delay of k clock strokes to a series input of the second addition circuit, PA1 selecting inputs of a third and fourth multiplexers so as to connect an output of the first register to series inputs of the first and second multiplication circuits. PA1 loading the contents of the third and fourth registers into fifth and sixth k-bit registers with parallel input and output, these outputs being connected to parallel inputs of the first and second multiplication circuits, PA1 performing, by simultaneous rightward shifting of the contents of the first register and of a seventh n-bit register with series input and output, multiplication operations of the words A.sub.1 and A.sub.0 by the piece of data B, the pieces of data produced by the first and second multiplication circuits being encoded on n+k bits, PA1 adding, in the first addition circuit, the bits produced by the first multiplication circuit with the bits given by the first multiplexer, PA1 storing the k first bits produced by the first multiplication circuit in an eighth n-bit register with series input and output, PA1 adding, in the second addition circuit, the n+k bits produced by the second multiplication circuit with the n most significant bits produced by the first multiplication circuit, these bits being complemented by k zeros, PA1 storing, in the eighth register, of the k first bits produced by the second addition circuit and the storage, in the seventh register, of the following n bits, PA1 during the above operations, transferring the words A.sub.3 and A.sub.2 into the third and fourth registers, PA1 selecting a second input of the first multiplexer in order to connect the output of the seventh register to the first input of the first addition circuit. PA1 loading the contents of the third and fourth registers into the fifth and sixth registers, PA1 performing, by simultaneous rightward shifting of the contents of the first and seventh registers, multiplication operations of the words A.sub.2j-1 and A.sub.2j-2 by the piece of data B, PA1 adding, in the first addition circuit, the bits produced by the first multiplication circuit with the contents of the seventh register, PA1 storing the k first bits produced by the first addition circuit in the eighth register, PA1 adding, in the second addition circuit, of the n+k bits produced by the second multiplication circuit with the n most significant bits produced by the first addition circuit complemented by k zeros to obtain an identical size for the pieces of data that are added up, PA1 storing, in the eighth register, of the k first bits produced by the second addition circuit and the storage, in the seventh register, of the n following bits, PA1 during the above operations, the transfer of the words A.sub.2j+1 and A.sub.2j into the third and fourth registers.
For further details on the arrangement of the different elements of the circuit with respect to one another, reference may be made to the document D2 and especially to FIGS. 2 and 3, and to the extracts from the description pertaining thereto: page 15, line 54 to page 16, line 13, and page 17, line 50 to page 18, line 55.
The use of the coprocessor 4a illustrated in FIG. 2 to carry out a P.sub.field (A,B).sub.N operation where A, B, and N are encoded on n bits in m words of k bits, N is an odd number and A is subdivided into m words Ai-1 (with i as an integer index varying from 1 to m), includes the following steps:
1--The initialization of the circuit
2--The setting up of a loop indexed by i with i varying from 1 to m
the subtraction, during the following iteration, of N or 0 from S depending on whether S is greater than N or not.
3--The output of the result S(k) by means of an output terminal
For further details on the running of a method of this kind, reference may be made to the document D2 and more particularly to the following extracts: page 4--line 41 to page 6--line 17 and page 19--lines 7 to 49.
Up till now, the use of the device shown in FIG. 1 could be used to optimize processing operations (in terms of computation time, memory size, etc.) for modular operations using a fixed data size, in this case 256 or 512 bits (depending on whether m is equal to 8 or 16). Now, cryptography requires increasingly efficient machines working at ever-higher speeds and using ever-more complex keys. The trend is thus towards the handling of pieces of data encoded on 768 or even 1024 bits. To process pieces of data of this kind, it is possible to envisage the use of larger-sized circuits by adapting the elements of the circuit to the sizes of the pieces of data. This approach may raise problems in applications such as smart card applications for which the size of the circuits is physically limited because of the differences in flexibility between the cards and the silicon substrates. Furthermore, there is a demand for the integration of increasing numbers of different functional elements on a card of this kind, and the place available for an encryption circuit is accordingly further reduced. It is therefore necessary to find solutions with which to limit the increase in the size of this circuit while at the same time enabling optimum operation for pieces of data whose size is greater than the size of the originally planned registers. This problem is not limited to modular arithmetic coprocessors that process pieces of data with a fixed size of 256 or 512 bits. It can also be transposed more generally to data-handling coprocessors that need to be used for operations on data whose size exceeds their processing capacity.
If it is desired to carry out modular operations using operands with a size greater than what is managed by the coprocessor (namely in practice greater than the size of the registers), it is possible to use a standard processor (with 8, 16 or 32 bits), a memory and the coprocessor of FIG. 1, the coprocessor being used to perform standard (that is to say non-modular) operations of multiplication.
It is possible, with the processor described in D2, to carry out standard operations of multiplication A * B on sizes of up to n bits by means of the following procedure.
1--Initialization
2--The setting up of a computation loop with i as an index varying from 1 to m
At the end of a procedure such as this, there is therefore the least significant bit of the result in the register 12 and the most significant bit of the result in the register 11. All that needs to be done is to add an output terminal connected to the output of the registers 11 and 12 to be able to recover the result at output of the coprocessor.
It is possible to perform the multiplication of a piece of data B encoded on n bits by means of a piece of data A encoded on m' words with m' as an integer greater than m. For this purpose, the loop is done with i varying from 1 to m'. At every m iterations, the contents of the register 12 are output by an output terminal.
Since the coprocessor can be used to carry out standard operations of multiplication, it is possible to perform modular operations on operands encoded on a number m' * k bits with m'&gt;m. For this purpose, the operands A, B and N are manipulated by being divided into q (q as an integer) sub-operands of n bits: A[q-1], A[q-2] . . . A[0], B[q-1], B[q-2] . . . B[0], N[q-1], N[q-2] . . . N[0]. The coprocessor is used to perform standard operations of multiplication on the sub-operands.
The following method is used:
1.1--The multiplication of B by the first sub-operand of the piece of data A
1--A[0] * B[0] is computed. The result has the form R[1].sub.0 R[0].sub.0 with R[1].sub.0 and R[0].sub.0 being pieces of binary data encoded on n bits. R[1].sub.0 and R[0].sub.0 are output from the registers 11 and 12 and they are stored in a memory.
2--A[0] * B[1] is computed. The result has the form R[1].sub.1 R[0].sub.1 with R[1].sub.1 and R[0].sub.1 being pieces of binary data encoded on n bits. R[1]1 and R[0]1 are output from the registers 11 and 12 and they are stored in a memory.
Q--A[0] * B[q-1] is computed. The result has the form R[1].sub.q-1 R[0].sub.q-1 with R[1].sub.q-1 and R[0].sub.q-1 being pieces of binary data encoded on n bits. R[1].sub.q-1 and R[0].sub.q-1 are output from the registers 11 and 12 and they are stored in a memory.
1.2--Computation of the result of the multiplication of B by the first sub-operand of A
If it is assumed that R[0].sub.0 =X[0], then the result of the multiplication is the piece of data X[q] X[q-1] . . . X[1] X[0].
It is of course possible to perform the addition operations as and when the results are output. This makes it possible to minimize the size of the memory in which the results are stored.
1.3--Computation of the result of a multiplication
X[0] * J[0]=Y, with Y having the form Y[1] Y[0], Y[1] and Y[0] being encoded on n words, output and storage of Y[0].
1.4--Computation of the result of the multiplication of the first sub-operand of Y by the piece of data N
1--Y[0] * N[0] is computed. The result has the form T[1].sub.0 T[0].sub.0 with T[1].sub.0 and T[0].sub.0 being binary pieces of data encoded on n bits. T[1].sub.0 and T[0].sub.0 are output from the registers 11 and 12 and they are stored in a memory.
2--Y[0] * N[1] is computed. The result has the form T[1].sub.1 T[0].sub.1 with T[1].sub.1 and T[0].sub.1 encoded on n bits. T[1].sub.1 and T[0].sub.1 are output from the registers 11 and 12 and they are stored in a memory.
Q--Y[0] * N[q-1] is computed. The result has the form T[1].sub.q-1 T[0].sub.q-1 with T[1].sub.q-1 and T[0].sub.q-1 being pieces of binary data encoded on m words of k bits. T[1].sub.q-1 and T[0].sub.q-1 are output from the registers 11 and 12 and they are stored in a memory.
1.5--Computation of the result of the multiplication of N by the first sub-operand of the piece of data Y
If it is assumed that T[0].sub.0 =U[0], then the result of the multiplication is the piece of data U[q] U[q-1] . . . U[1] U[0].
It is of course possible to perform the addition operations as and when the results are output. This makes it possible to minimize the size of the memory in which the results are stored.
1.6--Computation of the result of the modular multiplication of B by the first sub-operand of the piece of data A
The result Z of the addition has the form (c) Z[q] Z[q-1] . . . Z[1] Z[0].
2--Resumption of the steps 1.1 to 1.6 in considering the second sub-operand of the piece of data A and in modifying the step 1.2 as here below
Then:
Q--Resumption of the above step 2 in taking into consideration the qth sub-operand of A.
The final result of the computation is S(q)-(N or 0).
As can be seen, the method requires a certain number of exchanges of data with the exterior. These exchanges entail penalties in terms of computation time and memory space to store the results extracted from the coprocessor. Generally, the value of the coprocessors is that they use a faster clock frequency than that of the other elements that are connected to them. Hence, the value of using a coprocessor is minimized if the processing operations for which it is designed involve exchanges with circuits (standard processors, memories, etc.) that work more slowly, namely circuits to whose speeds they have to adapt during the exchanges.
The inventor has sought to modify the coprocessor illustrated in FIG. 1 so as to improve the processing of the above operations, and more particularly so as to reduce the processing time. To do this, the inventor proposes to modify the existing device so that it makes parallel use of the multiplication circuits 19 and 20, making it possible to perform operations, both modular and non-modular, at higher speed. The device according to the invention makes it possible to implement new methods of computation which are faster than the methods that can be implemented by the device of FIG. 1.