1. Field of the Invention
The invention relates to a modular arithmetic coprocessor able to perform non-modular operations at high speed. More specifically, the invention relates an improved arithmetic coprocessor enabling the performance of modular operations according to the Montgomery method in order to extend the possibilities of such a coprocessor. The Montgomery method enables the performance of modular computations in a finite field (or Galois field) denoted GF(2.sup.n) without the performance of divisions.
2. Discussion of the Related Art
Conventionally, modular operations on GF(2.sup.n) are used in cryptography for applications such as the authentication of messages, the identification of a user and the exchange of keys. Such exemplary applications are described for example in a document hereinafter referred to as D1 consisting of a French patent application published under No. 2 679 054.
There are commercially available integrated circuits dedicated to such applications. These include, for example the circuit manufactured by SGS-THOMSON MICROELECTRONICS S.A. and referenced ST16CF54, built around an association of the type including a central processing unit and an arithmetic coprocessor dedicated to the performance of modular computations. The coprocessor used enables the processing of the modular operations by the use of the Montgomery method. It is the object of a European patent application filed under the reference No. 0 601 907 A2, hereinafter called the document D2. This coprocessor is illustrated in FIG. 1 (this figure corresponds to FIG. 2 of the document D2).
The basic operation (called a P.sub.field operation) consists of the production, on the basis of three pieces of binary data A (multiplicand), B (multiplier) and N (modulo) encoded on an integer n of bits, of a piece of binary data denoted P.sub.field (A, B).sub.N encoded on n bits, such that P.sub.field (A, B).sub.N =A*B*I mod N. I is a piece of binary data called an error and encoded on n bits such that I=2.sup.-n mod N (more specifically , the value of I depends on the number of blocks of k bits, with k as an integer, considered for the encoding of A). To carry out the operation A*B*I mod N, it is assumed that the pieces of data are encoded on m words of k bits (m and k being whole numbers), with m*k=n. The words of the pieces of data A and B are given to a multiplication circuit 19 having a series input to receive B, a parallel input to receive the blocks of k bits of A, and a series output.
In the coprocessor described in the document D2, k=32 and m=8 or 16. This circuit may be used to produce the result of the modular multiplication A*B mod N. The modular multiplication can be broken down into two successive Pfield elementary operations. P.sub.field (P.sub.field (A,B).sub.N, H).sub.N, is computed with H as a piece of data encoded on n bits, called an error correction parameter, and equal to 2.sup.2n mod N. For further details on the implementation of modular multiplication, reference may be made to document D2. Several possibilities of computation are already known. They consist in using either a software method or a specialized circuit such as the one illustrated in document D2.
The circuit illustrated in FIG. 1 includes:
three shift registers 10, 11 and 12 with series input and output, these registers having a number n of cells, with n=m*k.
multiplexers 13, 14 and 15 placed respectively before the registers 10, 11 and 12.
three registers 16, 17 and 18 with series input and parallel output;
two multiplication circuits 19 and 20, including one series input, one parallel input and one series output;
two registers 21 and 22 with k cells;
multiplexers 24, 25, 26, 36, 37 and 38;
a demultiplexer 39;
series subtraction circuits 27, 28 and 29;
series addition circuits 30 and 31;
delay cells 32, 33 and 34 to delay the propagation of pieces of binary data by k cycle periods;
a comparison circuit 35.
For further details on the arrangement of the different elements of the circuit with respect to one another, reference may be made to the document D2 and especially to FIGS. 2 and 3, and to the extracts from the description pertaining thereto: page 15, line 54 to page 16, line 13, and page 17, line 50 to page 18, line 55.
The use of the circuit illustrated in FIG. 2 to carry out a P.sub.field (A,B).sub.N operation where A, B, and N are encoded on n bits in m words of k bits, N is an odd number and A is subdivided into m words A.sub.i-1 (with i as an integer index varying from 1 to m), includes the following steps:
1--Initialization of the Circuit
the software computation of a parameter J.sub.0 defined by the relationship : EQU [(N*J.sub.0)+1]mod 2.sup.k =0
the serial loading of the parameter J.sub.0 into the register 17,
the serial loading of B into the register 10,
the serial loading of N into the register 12,
the serial loading of A.sub.0 into the register 16,
the serial loading of n consecutive logic zeros into the register 11,
the initialization of the two multiplication circuits 19 and 20, the three subtraction circuits 27, 28 and 29, and the two addition circuits 30 and 31.
2--The Setting Up of a Loop Indexed by i with i Varying from 1 to m
the parallel loading into the register 21 of the contents of the register 16,
the performance of different elementary operations in order to perform the following computations: EQU X(i)=S(i-1)+B*A.sub.i-1 EQU Y.sub.0 (i)=[X(i)*J.sub.0 ]mod 2.sup.k EQU Z(i)=X(i)+[N*Y.sub.0 (i)] EQU S(i)=Z/2.sup.k, /being the integer division,
subtraction, during the following iteration, of N or 0 from S depending on whether S is greater than N or not.
3--The Output of the Result S(k) by Means of an Output Terminal
For further details on the running of a method of this kind, reference may be made to the document D2 and more particularly to the following extracts: page 4--line 41 to page 6--line 17 and page 19--lines 7 to 49.
Now, the use of the circuit shown in FIG. 1 can be used to improve processing operations (in terms of computation time, memory size, etc.) for modular operations using a fixed data size, in this case 256 or 512 bits (depending on whether m is equal to 8 or 16). Now, cryptography requires increasingly efficient machines working at ever-higher speeds and using ever-more complex keys. The trend is towards the handling of pieces of data encoded on 768 or even 1024 bits. To process pieces of data of this kind, it is possible to envisage the use of larger-sized circuits by adapting the elements of the circuit to the size of the pieces of data. This approach may raise problems in applications such as smart card applications for which the size of the card is physically limited because of the differences in flexibility between the cards and the silicon substrate. Furthermore, there is a demand for the integration of increasing numbers of different functional elements on a card of this kind, and the place available for an encryption circuit is thereby further reduced. It is therefore necessary to find solutions with which to limit the increase in the size of this circuit while at the same time enabling optimum operation for pieces of data whose size is greater than the size of the originally planned registers. This problem is not limited to modular arithmetic coprocessors that process pieces of data with a fixed size of 256 or 512 bits. It can also be transposed more generally to data-handling coprocessors that need to be used for operations on data whose size exceeds their processing capacity.
If it is desired to carry out modular operations using operands with a size greater than what is managed by the coprocessor (namely in practice greater than the size of the registers), it is possible to use a standard processor (with 8, 16 or 32 bits), a memory and the coprocessor of FIG. 1, the coprocessor being used to perform operations of multiplication.
It is possible, with the processor described in D2, to carry out standard operations of multiplication A*B on sizes of up to n bits by means of the following procedure.
1--Initialization
the loading of k logic zeros into the register 17,
the loading of B into the register 10,
the loading of n logic zeros into the register 11,
the loading of the word A.sub.0 of k bits of A into the register 16,
the initialization of the multiplier 19,
the initialization of the addition and subtraction circuits 28, 30 and 31.
2--The Setting up of a Computation Loop with i as an Index Varying from 1 to m
the loading of the contents of the register 16 into the register 21,
the performance, by a simultaneous rightward shift of the registers 10 and 11, of the multiplication of A.sub.i-1 by B and the addition of the contents of the register 11 with the result of the product,
the storage of the k least significant bits of the result into the register 12 by a k-bit rightward shift,
the storage of the n most significant bits of the result into the register 11,
the loading of the word A.sub.i into the register 16 (preferably during the running of one or more of the preceding steps).
At the end of a method such as this, there is therefore the least significant bits of the result in the register 12 and the most significant bits of the result in the register 11. All that needs to be done is to add an output terminal connected to the output of the registers 11 and 12 to be able to recover the result at the output of the coprocessor.
It is possible to perform the multiplication of a piece of data B encoded on n bits by means of a piece of data A encoded on m' words with m' as an integer greater than m. For this purpose, the loop is done with i varying from 1 to m'. At every m iterations, the contents of the register 12 are output by an output terminal.
Since the coprocessor can be used to carry out standard operations of multiplication, it is possible to perform modular operations on operands encoded on a number m'*k bits with m'&gt;m. For this purpose, the operands A, B and N are manipulated by being divided into q (q as an integer) sub-operands of n bits: A.sub.q-1, A.sub.q-2 . . . A.sub.0, B.sub.q-1, B.sub.q-2 . . . B.sub.0, N.sub.q-1, N.sub.q-2 . . . N.sub.0. The coprocessor is used to perform standard operations of multiplication on the sub-operand.
The following method is used:
1.1--Multiplication of B by the First Sub-operand of A
1--Computation of A.sub.0 *B.sub.0.
The result has the form R.sub.10 R.sub.00 with R.sub.10 and R.sub.00 being pieces of data encoded on n bits. R.sub.10 and R.sub.00 are output from the registers 11 and 12 and they are stored in a memory.
2--Computation of A.sub.0 *B.sub.1.
The result has the form R.sub.11 R.sub.01 with R.sub.11 and R.sub.01 encoded on n bits.
R.sub.11 and R.sub.01 are output from the registers 11 and 12 and they are stored in a memory.
Q--Computation of A.sub.00 *B.sub.q-1.
The result has the form R.sub.1q-1 R.sub.0q-1 with R.sub.1q-1 and R.sub.0q-1 encoded on n bits.
R.sub.1q-1 and R.sub.0q-1 are output from the registers 11 and 12 and they are stored in a memory.
1.2--Computation of the Result of the Multiplication of B by the First Sub-operand of A
computation of R.sub.10 +R.sub.01 and storage of the result referenced X.sub.1,
computation of c.sub.1 +R.sub.11 +R.sub.02 (c.sub.1 being the carry value of the previous calculation) and storage of the result referenced X.sub.2,
computation of c.sub.q-2 +R.sub.1q-2 +R.sub.0q-1 and storage of the result referenced X.sub.q-1,
computation of c.sub.q-1 +R.sub.1q-1 and storage of the result referenced X.sub.q.
If it is assumed that R.sub.00 =X.sub.0, then the result of the multiplication is the piece of data X.sub.q X.sub.q-1 . . . X.sub.1 X.sub.0.
It is of course possible to perform the addition operations as and when the results are output. This makes it possible to minimize the size of the memory in which the results are stored.
1.3--Computation of the Result of a Multiplication
X.sub.0 *J.sub.0 =Y with Y having the form Y.sub.1 Y.sub.0, Y.sub.1 and Y.sub.0 being encoded on n bits, output and storage of Y.sub.0.
1.4--Computation of the Result of the Multiplication of the First Sub-operand of Y by N
1--Computation of Y.sub.0 *N.sub.0.
The result has the form T.sub.10 T.sub.00 with T.sub.10 and T.sub.00 being encoded on n bits.
T.sub.10 and T.sub.00 are output from the registers 11 and 12 and they are stored in a memory.
2--Computation of Y.sub.0 *B.sub.1.
The result has the form T.sub.11 T.sub.01 with T.sub.11 and T.sub.01 encoded on n bits.
T.sub.11 and T.sub.01 are output from the registers 11 and 12 and they are stored in a memory.
Q--Computation of Y.sub.0 *N.sub.q-1.
The result has the form T.sub.1q-1 T.sub.0q-1 with T.sub.1q-1 and T.sub.0q-1 encoded on n bits.
T.sub.1q-1 and T.sub.0q-1 are output from the registers 11 and 12 and they are stored in a memory. 1.5--Computation of the Result of the Multiplication of N by the First Sub-operand of Y
computation of T.sub.10 +T.sub.01 and storage of the result referenced U.sub.1,
computation of c.sub.1 +T.sub.11 +T.sub.02 (c.sub.1 being the carry value of the previous computation) and storage of the result referenced U.sub.2,
computation of c.sub.q-2 +T.sub.1q-2 +T.sub.0q-1 and storage of the result referenced U.sub.q-1,
computation of c.sub.q-1 +T.sub.1q-1 and storage of the result referenced U.sub.q.
If it is assumed that T.sub.00 =U.sub.0, then the result of the multiplication is the piece of data U.sub.q U.sub.q-1 . . . U.sub.1 U.sub.0.
It is of course possible to perform the addition operations as and when the results are output. This makes it possible to reduce the size of the memory in which the results are stored.
1.6--Computation of the Result of the Modular Multiplication of B by the First Sub-operand of A
Computation of U+X and storage of the result, referenced Z.
The result Z of the addition has the form (c) Z.sub.q Z.sub.q-1 . . . Z.sub.1 Z.sub.0.
storage of S(1)=Z/2.sup.k -(N or 0).
2--Resumption of the Steps 1.1 to 1.6 in Considering the Second Sub-operand of A by Modifying the Step 1.2 as Here Below
computation of R.sub.10 +R.sub.01 and storage of the result referenced W.sub.1,
computation of c.sub.1 +R.sub.11 +R.sub.02 (c.sub.1 being the carry value of the previous calculation) and storage of the result referenced W.sub.2,
computation of c.sub.q-2 +R.sub.1q-2 +R.sub.0q-1 and storage of the result referenced W.sub.q-1,
computation of c.sub.q-1 +R.sub.1q-1 and storage of the result referenced W.sub.q.
Then:
computation of W+S(1) which then gives the result of the multiplication X with X having the form X.sub.q X.sub.q-1 . . . X.sub.1 X.sub.0 and W=W.sub.q W.sub.q-1 . . . W.sub.1 W.sub.0 with W.sub.0 =R.sub.00.
Q--Resumption of the Above Steps in Taking into Consideration the q.sup.th Sub-operand of A
The final result of the computation is S(q)-(N or 0).
As can be seen, the method requires a certain number of exchanges with the exterior. These exchanges entail penalties in terms of computation time and memory space to store, firstly, a part of the result extracted from the coprocessor and, secondly, the executable program corresponding to the method which is complex. Generally, the value of the coprocessors is that they use a faster clock frequency than that of the other elements that are connected to them. Hence, the value of using a coprocessor is minimized if the processing operations for which it is designed involve exchanges with circuits (standard processors, memories, etc.) that work more slowly, namely circuits to whose speeds they have to adapt during the exchanges.
The inventor has sought to modify the coprocessor illustrated in FIG. 1 so as to improve the processing of the above operation, and more particularly so as to reduce the processing time.