1. Field of the Invention
The invention relates to a device to improve the processing speed of a modular arithmetic coprocessor that is used in the implementation of modular operations according to the Montgomery method. Modular operations according to the Montgomery method enable the performance of modular computations in a finite field denoted GF(2.sup.n) (namely a Galois field with 2.sup.n elements) without carrying out any division.
2. Discussion of the Related Art
Conventionally, modular operations on GF(2.sup.n) are used in cryptography for applications such as the authentication of messages, the identification of a user and the exchange of keys. Such exemplary applications are described for example in the French patent application published under No. 2 679 054.
There are commercially available integrated circuits dedicated to such applications. These include, for example the product referenced ST16CF54 manufactured by SGS-THOMSON MICROELECTRONICS S.A., built around an association of the type comprising a central processing unit and an arithmetic coprocessor and dedicated to the performance of modular computations. The coprocessor used enables the processing of operations of modular multiplication by using the Montgomery method. It is the object of a European patent application filed under the reference No. 0 601 907 A2, and is illustrated in FIG. 1 (this FIG. corresponds to FIG. 2 of the European patent application referred to).
This circuit illustrated in FIG. 1 therefore enables the computation of a P.sub.field (A,B).sup.N operation. This P.sub.field (A,B).sup.N operation is equivalent to the performance of the computation (A * B * I) mod N where I is an error that can be compensated for. Commonly A, B and N are encoded on 256 and 512 bits and, in the near future, will be encoded on 1024 bits. The P.sub.field operation uses inter alia multiplication circuits. To reduce the size of these circuits, a working base is used. This working base enables the performance of certain computations on words of Bt bits.
Conventionally, Bt is equal to 32 bits.
The use of a circuit such as that of FIG. 1 to carry out a P.sub.field (A,B).sub.N operation where A, B and N are encoded on m * Bt bits, N is an odd parity value, A is subdivided into m words A.sub.i of Bt bits, is equivalent to performing the following steps:
E1: the resetting of the circuit: PA0 E2: first iteration: PA0 E3: following iterations, with i varying from 1 to m-1: PA0 E4: last iteration, either 0 or N is subtracted from S by means of the subtraction circuit 28, and then this value of S is loaded into the second register 10, PA0 E5: the contents of the second register are taken out by means of an output terminal. PA0 E1: computing a parameter J.sub.0 encoded on Bt bits, with J.sub.0 =-N.sub.0.sup.-1 mod 2.sup.Bt, N.sub.0 being a most significant word of a modulo N operation, and loading J.sub.0 into a first register of Bt bits; PA0 E2: loading a multiplicand A, a multiplier B and a modulo N value into respective n-bit registers, with n=m * Bt, and resetting another n-bit register, the contents of this register being denoted as S, with S as a binary variable data element encoded on n bits; PA0 E3: setting up a loop, indexed by an index i, with i varying from 1 to m, each ith iteration comprising the following operations: PA0 E4: at the mth iteration, disregarding the least significant word of Z(m) and entering the remaining words, namely Z(m)/2.sup.Bt, into the register containing B; and PA0 E5: outputting the result stored in the register containing B at the last iteration, possibly subtracting N if necessary.
the software computation of a parameter J.sub.0 defined by ((N * J.sub.0)+1) mod 2.sup.Bt =0, PA1 the loading of J.sub.0 into a first register 17 of Bt bits, PA1 the loading of B into a second register 10 of m*Bt bits, PA1 the loading of N into a third register 12 of m*Bt bits, PA1 the loading of A.sub.0 into a fourth register 16 of Bt bits, PA1 the loading of 0 into a fifth register 11 of m*Bt bits, PA1 the resetting of a first multiplication circuit 19 and a second multiplication circuit 20, PA1 the resetting of a first subtraction circuit 27, a second subtraction circuit 28 and a third subtraction circuit 29, PA1 the resetting of a first addition circuit 30 and a second addition circuit 31, PA1 the loading into a sixth register 21 of the contents of the fourth register 16, PA1 the performance of different elementary operations in order to carry out the following computations: PA1 the loading of A.sub.i into a fourth register 16, PA1 the loading into a sixth register 21 of the contents of the fourth register 16, PA1 the performance of different elementary operations in order to carry out the following computations: PA1 the loading of Ai into a fourth register 16 uses Bt loading clock cycles, PA1 the loading into the sixth register 21 of the contents of the fourth register 16 uses one internal cycle, PA1 the performance of the different elementary operations, in which the different computations of the iteration are carried out, does not require waiting for the result of one computation in order to begin the next computation, and uses (m+2) * Bt+x internal clock cycles, x being a number that corresponds to iterations of synchronization and resetting. Hereinafter in this document, x shall be disregarded. PA1 a first register of Bt bits to contain Bt bits of A having one series input and one parallel output, PA1 a second, third and fourth shift register of m * Bt bits, m being an integer greater than 1, having one series input and one series output, these registers containing B, N and the result,wherein there is added a fifth shift register of m * Bt bits, in order to contain m * Bt bits of A, a series input of this fifth register being connected to an input terminal, and a series output of this fifth register being connected to the input of the first register. PA1 a) rightward shifting by Bt bits the register containing A, loading these Bt bits corresponding a word A.sub.i of A into a second register of Bt bits, PA1 b) transforming the ith word A.sub.i-1 of the multiplicand A from the second register to a storage register, PA1 c) producing a value X(i)=S(i-1)+B * A.sub.i-1 with S(0)=0 and S(i-1) as the so-called updated value of S, PA1 d) producing a value Y.sub.0 (i)=X.sub.0 (i) * J.sub.0 mod 2.sup.Bt, PA1 e) computing a value Z(i)=X(i)+Y.sub.0 (i) * N PA1 f) not taking account of the least significant word of Z(i) and storing the remaining words, namely Z(i)/2.sup.Bt, in the register containing S, PA1 g) performing a bit-by-bit comparison of Z(i)/2.sup.Bt with N in order to determine the updated value S(i) of the next iteration, in the manner described here above, this comparison being done through the bit-by-bit subtraction of Z(i)/2.sup.Bt and N in a series subtraction circuit, N having been delayed by Bt additional cycles,
X=A.sub.0 * B, PA2 Y.sub.0 =(X * J.sub.0) mod 2.sup.Bt, PA2 Z=X+(N * Y.sub.0) PA2 S=Z.backslash.2.sup.Bt,.backslash.being an integer division, PA2 if S is greater than N, then N is subtracted from S at the next iteration, PA2 S is loaded into the fifth register 11, PA2 X=S+A.sub.i * B, PA2 Y.sub.0 =(X * J.sub.0) mod 2.sup.Bt, PA2 Z=X+(N * Y.sub.0) PA2 S=Z.backslash.2.sup.Bt,.backslash.being an integer division, PA2 if S is greater than N, then N is subtracted from S at the next iteration, PA2 S is loaded into the fifth register 11,
For further details of the running of such a method, reference may be made to the patent application EP-A-0 601 907 and more particularly to page 4 line 41 to page 6 line 17 and page 19 lines 7 to 49.
A method of this kind uses a certain number of internal clock cycles and loading clock cycles. Special attention will be paid to the number of cycles of an iteration, for example in the step E3:
In practice, it is possible to load A.sub.i while carrying out the computations of an iteration, as soon as the contents of the fourth register 16 are loaded into the sixth register 21. There is no problem if the loading clock is the same as the internal clock.
Now, the loading clock depends on a variety of external elements, for example a communications bus that conveys data elements. The system constituted by the central processing unit and coprocessor will have constraints on the setting of the pace of the loading clock which may in practice, for example, be at a frequency of 2 MHz. An integrated circuit on the contrary may permit a cycle period of the order of 10 ns, corresponding to a frequency of 100 MHz internally. The fact of loading Ai during the performance of an iteration provides for a loading cycle time that is m+2 times greater than an internal cycle time, so as not to slow down the process. In other words, this makes it possible to have an internal processing frequency at least equal to m+2 times the loading frequency without any slowing down of the processing operation by the loading.
In practice, a working base of 32 bits is used with 256-bit words or 512-bit words. This corresponds to m=8 or 16. At worst, this enables an internal processing frequency that is only ten times greater than the loading frequency, without being affected by the loading time.