This application is based upon and claims priority from prior French Patent Application No. 99-03409, filed Mar. 17, 1999, the entire disclosure of which is herein incorporated by reference.
1. Field of the Invention
The present invention generally relates to a device and method for the implementation of an elementary modular operation according to the Montgomery method. This method can be used to perform modular computations in a finite field (or Galois field) without performing divisions.
Conventionally, modular operations in finite fields are used in cryptography for applications such as the authentication of messages, and the identification of a user or the exchange of keys. Exemplary applications of this kind are described for example in patent application FR-A-2 679 054 (hereinafter D1).
2. Description of Related Art
There are commercially available integrated circuits that are generally dedicated to such applications, for example the product manufactured by STMicroelectronics S.A. referenced ST16CF54 built around an association of the type comprising a central processing unit and an arithmetic coprocessor and generally dedicated to the implementation of modular computations. The coprocessor used enables the processing of the operations of modular multiplication by using the Montgomery method. It is the object of the patent application EP-A-0 601 907 (hereinafter D2).
The basic operation, called a Pfield operation, consists of the use of three binary data elements, A (multiplicand), B (multiplier lower than N) and N (modulo), encoded on an integer number n of bits, to produce a binary data element referenced P(A, B)N encoded on n bits such that P(A, B)N=A*B*I mod N, with I=2xe2x88x92n mod N. For this purpose, it is assumed that the data elements are encoded on m words of k bits with m and k as integers such that m*k=n, and the words A and B are given to a multiplication circuit having a serial input, a parallel input and a series output.
For the coprocessor described in D2, we have k=32 and m=8 or 16. FIG. 1 shows the modular arithmetic coprocessor disclosed by D2. This coprocessor comprises the following elements: three shift registers 10, 11 and 12, with m*k bits, designed to receive respectively the multiplier B, the result S and the modulo N, multiplexers 13 to 15 that are respectively connected to the inputs of the registers 10 to 12, three k-bit shift registers 16, 17 and 18 having one series input and one parallel output, designed to receive respectively k bits of the multiplicand A, a computation parameter referenced J0, an intermediate result referenced Y0, two multiplication circuits 19 and 20 each having one series input, one parallel k-bit input and one series output, two k-bit parallel latches 21 and 22 used as a buffer for the multiplication circuits 19 and 20, a multiplexer 23 used to connect the latch 22 either to the register 17 or to the register 18, three multiplexers 24, 25 and 26 used to route the data elements to the inputs of the multiplication circuits 19 and 20, three subtraction circuits 27, 28 and 29 each comprising two series inputs and one series output, two addition circuits 30 and 31, each having two series inputs and one series output, three delay cells 32, 33 and 34 that are actually k-bit shift registers, and are used to delay the data elements by k clock cycles to mask the computation time of the multiplication circuits 19 and 20, a comparison circuit 35, two multiplexers 36 and 37 used to control the subtraction circuits 27 and 28, a multiplexer 38, and a demultiplexer 39.
For further details on the making of these elements, reference may be made to D2.
To perform an elementary operation called a Pfield operation of the PField(A, B)N=A*B*I mod N type, A and B being encoded on a number m of k-bit words and I being an error equal to 2xe2x88x92m*k, the iteration of the next loop is performed m times with i being an index varying from 0 to mxe2x88x921:
X=S(ixe2x88x921)+Ai*B,
Y0=(X*J0)mod 2k,
Z=X+(N*Y0)
S(i)=Z2k, being an integer division,
if S(i) is greater than N, then N is subtracted from S(i) before the next iteration, with S(xe2x88x921)=0, Ai being the ith k-bit word of A, J0 being a k-bit word defined by the equation ((N*J0)+1)mod 2k=0.
The coprocessor of FIG. 1 enables the performance of a full iteration by a simultaneous shift, by m*k bits, of the registers 10 to 12 respectively containing B, S(ixe2x88x921) and N followed by a shift, by 2*k bits, of the register 12 to store S(i), the word Ai being loaded into the register 21 and the word J0 being loaded into the register 17. To perform the full computation of PField(A, B)N, it is enough to repeat each iteration m times by changing the word Ai contained in the register 21 during each iteration.
The operation xe2x80x9cX=S(ixe2x88x921)+A*Bxe2x80x9d is done by means of the multiplication circuit 19 and the addition circuit 30. The operation xe2x80x9cY0=(X*J0)mod 2kxe2x80x9d is done, during the k first shifts, in the multiplication circuit 20, in taking care to store J0 in the register 22, the result Y0 being stored in the register 18. The operation xe2x80x9cZ=X+(N+Y0)xe2x80x9d, N and X having been delayed by k bits in the delay cells 32 and 34 and Y0 having been put into the register 22, is performed by means of the multiplication circuit 20 and the addition circuit 31. The operation xe2x80x9cS(i)=Z2kxe2x80x9d is done by a k-bit shift. The comparison of S(i) with N is done by the subtraction of N from S(i) in the subtraction circuit 29, N being delayed by k bits in the cell 33, a possible overflow being detected and stored in the comparison circuit 35 to find out the result of the comparison. The subtraction of N from S(i) is done during the next iteration in the subtraction circuit 28.
Many improvements have been made in this circuit. The improvements are aimed at obtaining higher speeds and/or reducing the size of the circuit and/or reducing the consumption of the circuit and/or providing additional functions without considerably increasing the size of the circuit. Those skilled in the art may refer, inter alia, to the publications of the European patent applications EP-0 712 070, EP-0 712 071, EP-0 712 072, EP-0 778 518, EP-0 784 262, EP-0 785 502, EP-0 785 503, EP-0 793 165, EP-0 853 275, and also to the publication of the international patent application WO/97 25668.
There is also another circuit known from the publication of the European patent application EP-0 566 498 (hereinafter D3) enabling the computation of the elementary operation P(A, B)N=A*B*I mod N, with I=2xe2x88x92n and n being the size of A, B or N. The circuit of D3 uses a single parallel/series multiplication circuit, represented in D3 in the form of a parallel adder coupled with a shift register. The circuit of D3 does not reproduce exactly the Montgomery algorithm and uses an intermediate data element equal to (Nxe2x88x921)/2+1. The circuit of D3 uses a multiplication circuit having an n-bit parallel input and is limited to computation operands with a permanently fixed size. Furthermore, the size of the circuit of D3 is proportional to the size of the operands used, the surface area thus occupied being considerable.
The present invention is aimed at improving the prior art by proposing a coprocessor that uses a single multiplication circuit coupled to a computation circuit dedicated to the computation of Y0, with Y0=(X*J0)mod 2k, J0 being defined by the equation ((N*J0)+1)mod 2k=0. The computation of Y0 is done according to the invention bit by bit, one clock half-cycle before the use of each bit. The invention also proposes a method for the computation of a modular operation using the circuit for the computation of Y0.
Thus, there is a need to overcome the disadvantages of the prior art as discussed above.
In accordance with preferred embodiments of the present invention, as discussed below, it is intended to overcome the above-mentioned drawbacks of the prior art.
A preferred embodiment of the present invention provides an integrated circuit comprising a modular arithmetic coprocessor comprising storage means for storing and providing, in series, first and second operands A and B, a modulo N and a result S, with A as an integer encoded on a*k bits, a being a non-zero integer at most equal to m, and with B, N and S being integers encoded on at most m*k bits, m and k being integers greater than 1; computation means to perform modular operations according to the Montgomery method, wherein the computation means comprise a first output to provide an updated result S(i) bit by bit and a second output to give an anticipated result and wherein said coprocessor furthermore comprises a computation circuit that receives the anticipated result and produces an intermediate data element Y0 that is reinjected into the computation means to produce the updated result S(i).
Preferably, the circuit to compute the data element Y0 comprises a latch synchronized with the edges unused by the computation means, said latch being used to store a bit of Y0; means to give zero during a first half-cycle of a clock signal and to give the bit stored by the latch during the second half-cycle of the clock signal.
According to a second preferred embodiment of the present invention, a method performs a modular operation according to the Montgomery method by the serial shifting of first and second operands A and B, a modulo N and an updated result through computation means, with A as an integer encoded on a*k bits, a being a non-zero integer at most equal to m and with B, N and S being integers encoded on at most m*k bits, m and k being integers greater than 1, wherein an intermediate data element, Y0 is computed such that Y0=((S(ixe2x88x921)+(Ai*B))*J0)mod 2k in an iterative loop indexed by i, i varying from 0 to axe2x88x921, with S(ixe2x88x921) corresponding to the (ixe2x88x921)th updated result, S(xe2x88x921) being equal to 0, Ai being the ith k-bit word of A, and J0 being a k-bit word resolving the equation ((J0 *N)+1)mod 2k=0 wherein Y0 is given by a computation circuit that computes each bit of Y0 during a half-cycle of a clock period that receives the use of said each bit of Y0.
Preferably, said method furthermore comprises the repetition of the following steps, i being an index varying from 0 to axe2x88x921:
the storage of a k-bit word Ai corresponding to a word with a place value i of A in a first latch of k bits;
the storage of the least significant k-bit word of N and then of Y0 in a second k-bit latch;
the addition in a parallel addition circuit of the words contained in the first and second latches;
the selection and supply either of the word contained in the first latch or of the word contained in the second latch or of the sum of the words contained in the first and second latches or of the word zero as a function firstly of a bit of B and secondly of a bit of Y0 or a bit of N;
the successive addition, in an accumulator circuit, of the words given by the selection device for each pair of bits of B and of N, the result of each addition being added to a bit of the preceding updated result S(ixe2x88x921) then offset by one bit and stored between each addition, the bit output from the accumulator during the shift corresponding to a new updated result S(i).
Other features and advantages of the preferred embodiments of the present invention will become apparent from the following detailed description. It should be understood, however, that the detailed description and specific examples, while indicating preferred embodiments of the present invention, are given by way of illustration only and various modifications may naturally be performed without deviating from the spirit of the present invention.