The present invention relates to apparatus and methods for modular multiplication and exponentiation and for serial integer division, and to apparatus operative to accelerate and secure computer peripherals, especially coprocessors used for cryptographic computations.
A compact microelectronic device for performing modular multiplication and exponentiation over large numbers is described in Applicant""s U.S. Pat. No. 5,513,133, the disclosure of which is hereby incorporated by reference.
Security enhancements and performance accelerations for computational devices are described in Applicant""s U.S. Pat. Nos. 5,742,530, 5,513,133, 5,448,639, 5,261,001; and 5,206,824 and published PCT patent application PCT/IL98/00148 (WO98/50851); and U.S. patent application Ser. No. 09/050958, Onyszchuk et al""s U.S. Pat. No. 4,745,568; Omura et al""s U.S. Pat. No. 4,587,627, the disclosures of which are hereby incorporated by reference.
The disclosures of all publications mentioned in the specification and of the publications cited therein are hereby incorporated by reference.
The present invention seeks to provide improved apparatus and methods for modular multiplication and exponentiation and for serial integer division, and for accelerating and securing modular arithmetic processors and accelerating memory transfers to computer peripheral that need simplified accelerated memory to peripheral data transfers with limited CPU core changes.
There is thus provided, in accordance with a preferred embodiment of the present invention, a modular multiplication and exponentiation system including a serial-parallel arithmetic logic unit (ALU) including a single multiplier including a single carry-save adder and preferably including a serial division device operative to receive a dividend of any bit length and a divisor of any bit length and to compute a quotient and a remainder.
Further in accordance with a preferred embodiment of the present invention, the system is operative to multiply at least one pair of integer inputs of any bit length.
Still further in accordance with a preferred embodiment of the present invention, the at least one pair of integer inputs includes two pairs of integer inputs.
Additionally in accordance with a preferred embodiment of the present invention, the ALU is operative to generate a product of integer inputs and to reduce the size of the product without previously computing a zero-forcing Montgomery constant, J0.
Also provided, in accordance with another preferred embodiment of the present invention, is serial integer division apparatus including a serial division device operative to receive a dividend of any bit length and a divisor of any bit length and to compute a quotient and a remainder.
Further in accordance with a preferred embodiment of the present invention, the apparatus includes a pair of registers for storing a pair of integer inputs and which is operative to multiply a pair of integer inputs, at least one of which exceeds the bit length of its respective register, without interleaving.
Also provided, in accordance with yet another preferred embodiment of the present invention, is a modular multiplication and exponentiation system including a serial-parallel multiplying device having only one carry-save accumulator and being operative to perform a pair of multiplications and to sum results thereof.
Additionally provided, in accordance with still another preferred embodiment of the present invention, is a modular multiplication and exponentiation method including providing a serial-parallel arithmetic logic unit (ALU) including a single modular multiplying device having a single carry-save adder, and employing the serial-parallel ALU to perform modular multiplication and exponentiation.
Further provided, in accordance with yet another preferred embodiment of the present invention, is a method for natural (not modular) multiplication of large integers, the method including providing a serial-parallel arithmetic logic unit (ALU) including a single modular multiplying device having a single carry-save adder, and employing the serial-parallel ALU to perform natural (not modular) multiplication of large integers.
Further in accordance with a preferred embodiment of the present invention, the employing step includes multiplying a first integer of any bit length by a second integer of any bit length to obtain a first product, multiplying a third integer of any bit length by a fourth integer of any bit length to obtain a second product, and summing the first and second products with a fifth integer of any bit length to obtain a sum.
Still further in accordance with a preferred embodiment of the present invention, the employing step includes performing modular multiplication and exponentiation with a multiplicand, multiplier and modulus of any bit length.
Additionally in accordance with a preferred embodiment of the present invention, the system also includes a double multiplicand precomputing system for executing Montgomery modular multiplication with only one precomputed constant.
Further in accordance with yet another preferred embodiment of the present invention, is the employing step includes performing Montgomery multiplication including generating a product of integer inputs including a multiplier and a multiplicand, and executing modular reduction without previously computing a Montgomery constant J0.
Further in accordance with a preferred embodiment of the present invention, the Montgomery constant J0 includes a function of N mod 2k, where N is a modulus of the modular reduction and k is the bit-length of the multiplicand.
Still further in accordance with a preferred embodiment of the present invention, the employing step includes performing a sequence of interleaved Montgomery multiplication operations.
Additionally in accordance with a preferred embodiment of the present invention, each of the interleaved Montgomery multiplication operations is performed without previously computing the number of times the modulus must be summated into a congruence of the multiplication operation in order to force a result with at least k significant zeros.
Still further in accordance with a preferred embodiment of the present invention, the system also includes a data preprocessor operative to collect and serially summate multiplicands generated in an i""th interleaved Montgomery multiplication operation thereby to generate a sum and to feed in the sum to an (i+1)""th Montgomery multiplication operation.
Additionally in accordance with a preferred embodiment of the present invention, the function includes an additive inverse of a multiplicative inverse of N mod 2k.
Further in accordance with a preferred embodiment of the present invention, the method also comprises computing J0 by resetting Ai and B to zero and setting S0=1.
The present invention also relates to a compact microelectronic arithmetic logic unit, ALU, for performing modular and normal (natural, non-negative field of integers) multiplication, division, addition, subtraction and exponentiation over very large integers. When referring to modular multiplication and squaring using Montgomery methods, reference is made to the specific parts of the device as a modular arithmetic coprocessor, and the acronym, MAP, is used. Reference is also made to the Montgomery multiplication methods as MM.
The present invention also relates to arithmetic processing of large integers. These large numbers can be in the natural field of (non-negative) integers or in the Galois field of prime numbers, GF(p), and also of composite prime moduli. More specifically, the invention relates to a device that can implement modular multiplications/exponentiations of large numbers, which is suitable for performing the operations essential to Public Key Cryptographic authentication and encryption protocols, which work over increasingly large operands and which cannot be executed efficiently with present generation modular arithmetic coprocessers, and cannot be executed securely with software implementations. The invention can be driven by any 4 bit or longer processor, achieving speeds which can surpass present day digital signal processors.
The present invention also relates to the hardware implementation of large operand integer arithmetic, especially as concerns the numerical manipulations in a derivative of a procedure known as the interleaved Montgomery multiprecision modular multiplication method often used in encryption software oriented systems, but also of intrinsic value in basic arithmetic operations on long operand integers; in particular, Axc2x7B+Cxc2x7D+S, wherein there is no theoretical limit on the sizes of A, B, C, D, or S. In addition, the device is especially attuned to perform modular multiplication and exponentiation. The basic device is particularly suited to be a modular arithmetic co-processor (MAP), also including a device for performing division of very large integers, wherein the divisor can have a bit length as long as the modulus register N and the bit length of the dividend can be as large as the bit length of two concatenated registers.
This device preferably performs all of the functions of U.S. Pat. No. 5,513,133, with the same order of logic gates, in less than half the number of machine clock cycles. This is mostly because there is only one double action serial/parallel multiplier instead of two half size multipliers using the same carry save accumulator mechanism, the main component of a conventional serial parallel multiplier. The new arithmetic logic unit, ALU, or specifically the modular arithmetic coprocessor, MAP, preferably intrinsically obviates a separate multiplication process which would have preceded the new process. This process would also have required a second Montgomery constant, J0, which is now also preferably obviated. Stated differently, instead of the two constants in the previous Montgomery procedures, and the delays encountered, only one constant is now computed, and the delay caused by the now superfluous J type multiplications (explained later) is preferably removed.
Further, by better control of the data manipulations, between the CPU and this peripheral device, operands which are performed on operands longer than the natural register size of the device, can preferably be performed at reduced processing times using less temporary storage memory.
Three related methods are known for performing modular multiplication with Montgomery""s methodology. [P. L. Montgomery, xe2x80x9cModular multiplication without trial divisionxe2x80x9d, Mathematics of Computation, vol. 44, pp. 519-521, 1985], hereinafter referred to as xe2x80x9cMontgomeryxe2x80x9d, [S. R. Dussxc3xa9 and B. S. Kaliski Jr., xe2x80x9cA cryptographic library for the Motorola DSP 56000xe2x80x9d, Proc Eurocrypt ""90, Springer-Verlag, Berlin, 1990] hereinafter referred to as xe2x80x9cDussxc3xa9xe2x80x9d the method of U.S. Pat. No. 4,514,592 to Miyaguchi, and the method of U.S. Pat. No. 5,101,431, to Even, and the method of U.S. Pat. No. 5,321,752 to Iwamura, and the method of U.S. Pat. No. 5,448,639, to Arazi, and the method of U.S. Pat. No. 5,513,133 to Gressel.
The preferred architecture is of a machine that can be integrated to any microcontroller design, mapped into the host controller""s memory; while working in parallel with the controller which for very long commands constantly swap or feed operands to and from the data feeding mechanism, allowing for modular arithmetic computations of any popular length where the size of the coprocessor volatile memory necessary for manipulations should rarely be more than three times the length of the largest operand.
This solution preferably uses only one multiplying device which inherently serves the function of two multiplying devices, in previous implementations. Using present popular technologies, it enables the integration of the complete solution including a microcontroller with memories onto a 4 by 4.5 by 0.2 mm microelectronic circuit.
The invention is also directed to the architecture of a digital device which is intended to be a peripheral to a conventional digital processor, with computational, logical and architectural novel features relative to the processes published by Montgomery and Dussxc3xa9, as described in detail below.
A concurrent process and a unique hardware architecture are provided, to perform modular exponentiation without division preferably with the same number of operations as would be performed with a classic multiplication/division device, wherein a classic device would perform both a multiplication and a division on each operation. A particular feature of a preferred embodiment of the present invention is the concurrency of operations performed by the device to allow for unlimited operand lengths, with uninterrupted efficient use of resources, allowing for the basic large operand integer arithmetic functions.
The advantages realized by a preferred embodiment of this invention result from a synchronized sequence of serial processes, which are merged to simultaneously (in parallel) achieve three multiplication operations on n bit operands, using one multiplexed k bit serial/parallel multiplier in (n+k) effective clock cycles, accomplishing the equivalent of three multiplication computations, as prescribed by Montgomery.
By synchronizing and on the fly detecting and on the fly preloading and simultaneous addition of next to be used operands, the machine operates in a deterministic fashion, wherein all multiplications and exponentiations are executed in a predetermined number of clock cycles. Conditional branches are replaced with local detection and compensation devices, thereby providing a basis for the simple type control mechanism, which, when refined, typically include a series of self-exciting cascaded counters. The basic operations herein described can be executed in deterministic time using the device described in U.S. Pat. No. 5,513,133 as manufactured both by Motorola in East Kilbride, Scotland under the trade name SC-49, and by SGS-Thomson in Rousset, France, under the trade name ST16-CF54.
The machine has particularly lean demands on volatile memory for most operations, as operands are loaded into and stored in the device for the total length of the operation, however, the machine preferably exploits the CPU onto which it is appended, to execute simple loads and unloads, and sequencing of commands to the machine, whilst the machine performs its large number computations. The exponentiation processing time is virtually independent of the CPU which controls it. In practice, no architectural changes are necessary when appending the machine to any CPU. The hardware device is self-contained, and can be appended to any CPU bus.
Apparatus for accelerating the modular multiplication and exponentiation process is preferably provided, including means for precomputing the necessary constants.
The preferred embodiments of the invention described herein provide a modular mathematical operator for public key cryptographic applications on portable Smart Cards, typically identical in shape and size to the popular magnetic stripe credit and bank cards. Similar Smart Cards (as per U.S. Pat. No. 5,513,133) are being used in the new generation of public key cryptographic devices for controlling access to computers, databases, and critical installations; to regulate and secure data flow in commercial, military and domestic transactions; to decrypt scrambled pay television programs, etc. It should be appreciated that these devices are also incorporated in computer and fax terminals, door locks, vending machines, etc.
The hardware described carries out modular multiplication and exponentiation by applying the  operator in a novel way. Further, the squaring can be carried out in the same method, by applying it to a multiplicand and a multiplier that are equal. Modular exponentiation involves a succession of modular multiplications and squarings, and therefore is carried out by a method which comprises the repeated, suitably combined and oriented application of the aforesaid multiplication, squaring and exponentiation methods.
When describing the workings of a preferred embodiment of the ALU we describe synchronization in effective clock cycles, referring to those cycles when the unit is performing an arithmetic operation, as opposed to real clock cycles, which would include idle cycles whence the ALU may stand, and multiplexers, flipflops, and other device settings may be altered, in preparation for a new phase of operations.
In a preferred embodiment, a method for executing a Montgomery modular multiplication, (with reference to squaring and normal multiplication) wherein the multiplicand A (which may be stored either in the CPU""s volatile RAM or in the SA register, 130, the multiplier B in the B register 1000, which is a concatenation of 70 and 80 and the modulus N in the N register, 1005, which is a concatenation of 200 and 210; comprise m characters of k bits each, the multiplicand and the multiplier generally not being greater than the modulus, comprises the steps of:
1)xe2x80x94loading the multiplier B and the modulus, N, into respective registers of n bit length, wherein n=mxc2x7k;
{multiplying in normal field positive, natural, integers, N is a second multiplier}
{if n is longer than the B, N and S registers, values are typically loaded and unloaded in and out of these registers during the course of an iteration, allowing the machine to be virtually capable of manipulating any length of modulus}
2)xe2x80x94setting the output of the register SB to zero, S*d Flush (250)=1 for the first iteration;
3)xe2x80x94resetting extraneous borrow and carry flags (controls, not specified in the patent),
4)xe2x80x94executing m iterations, each iteration comprising the following operations:
(0xe2x89xa6ixe2x89xa6mxe2x88x921)
a) transferring the next character Aixe2x88x921 of the multiplicand A from volatile storage to the Ai Load Buffer, 290.
b) simultaneously serially loading the Ci Load Buffer, 320, with N0 (the LS k bits of N), while rotating the contents of the Ai Load Buffer, thereby serially adding the contents of the Ai load buffer with N0 by means of the serial adder FA1, 330, thereby serially loading the Ai+Ci Load Buffer with the sum N0+Aixe2x88x921,
The preloading phase ends here. This phase is typically executed whilst the MAP was performing a previous multiplication iteration. Processes a) and b) can be executed simultaneously, wherein the Aixe2x88x921 character is loaded into its respective register, whilst the Ai stream is synchronized with the rotation of the N0 register, loading R2, 320. Simultaneously, the Ai stream and the N0 stream are summated and loaded into the R3 register, 340.
Squaring a quantity from the B register, can be executed wherein at the initialization, Steps a) and b) the first k bits of Bd are inserted into R1, as the B0 register is rotated, simultaneously with the N0 register. Subsequent k bit Bi strings are preloaded into the R1 register, as they are fed serially into the ALU.
c) the machine is stopped. Operands in buffers R1, R2, and R3 are latched into latches L1, 360; L2, 370; and L3, 380.
The L0xe2x80x94xe2x80x9c0xe2x80x9d latch, is a pseudo latch, as this is simply a literal command signal entering each of the AND gates in the inputs or outputs of the 390, multiplexer.
d) for the next k effective clock cycles
i) at each effective clock cycle the Y0 SENSE anticipates the next bit of Y0 and loads this bit through M3 multiplexer, 300, into the Ci Load Buffer, while shifting out the Ai bits from the R1 register and simultaneously loading the Ci Load Buffer with k bits of Y0 and adding the output of R1 with Y0 and loading this value into the R3 Buffer,
ii) simultaneously multiplying N0 (in L2, Ci Latch) by the incoming Y0 bit, and multiplying Ai by the next incoming bit of Bd, by means of logically choosing through the M_K multiplexer, 390, the desired value from one of the four latches, L0, L1, L2 or L3; thereby adding the two results. If neither the Y0 bit nor the B bit is one, an all zero value is multiplexed into the CSA, if only the Y0 bit is one, N0 alone is multiplexed/added into the CSA, if only the B bit is a one, the (ixe2x88x921)th bit of A is added into the CSA, if both the B bit and the Y0 bit are ones, then the sum of the (ixe2x88x921)th bit of A, and N0 is added into the CSA,
iii) then adding to this summation; as it serially exits the Carry Save k+1 Bit Accumulator bit by bit, (the X stream); the next relevant bit of Sd in through the serial adder, FA2, 460,
In MM these first k bits of the Z stream are zero. In this first phase the result of Y0xc2x7N0+Aixe2x88x921xc2x7B0+S0 has been computed, the LS k all zero bits appeared on the Z*out stream, and the MS k+1 bits of the multiplying device are saved in the CSA Carry Save Accumulator; the R1, R2 and R3 preload buffers hold the values Aixe2x88x921, Y0 and Y0+Aixe2x88x921, respectively.
e) at the last effective, (m+1)xc2x7k""th, clock cycle he machine is stopped, buffers R2, and R3 are latched into L2, and L3
The value of L1 is unchanged.
The initial and continuing conditions for the next kxc2x7(mxe2x88x921) effective clock cycles are:
the multipliers are the bit streams from B, starting from the k""th bit of B and the remaining bit stream from N, also starting from the k""th bit of N;
and the multiplicands in L1, L2, and L3 are Aixe2x88x921, Y0, and Y0+Aixe2x88x921, at the start the CS adder contains the value as described in d), and the S stream will feed in the next kxc2x7(mxe2x88x921) bits into the FA2 full adder; during the next kxc2x7m effective clock cycles, Nd, delayed k clock cycles in unit 470, is subtracted in serial subtractor, 480, from the Z stream, to sense if (Z/2k mod 2k*m), the result which is to go into the B or S register, is larger than or equal to N. Regardless of what is sensed by the serial subtractor, 460, if at the {(m+1)xc2x7k}""th effective clock cycle, the SO1 flip-flop of the CSA is a one, then the total result is certainly larger than N, and N will be subtracted from the result, as the result, partial or final, exits its register.
f) for the next kxc2x7(mxe2x88x921) effective clock cycles:
the N0 Register, 210, is rotated either synchronously with incoming Ai bits, or at another suitable timing, loading R1, R2, and R3, as described in a) and b), for the next iteration,
for these kxc2x7(mxe2x88x921) effective clock cycles, the remaining MS bits of N now multiply Y0, the remaining MS B bits continue multiplying Aixe2x88x921. If neither the N bit nor the B bit is one, an all zero value is multiplexed into the CSA. If only the N bit is one, Y0 alone is multiplexed/added into the CSA. If only the B bit is a one, Aixe2x88x921 is added into the CSA. If both the B bit and the Y0 bit are ones, then Aixe2x88x921+Y0 are added into the CSA.
Simultaneously the serial output from the CSA is added to the next kxc2x7(mxe2x88x921) S bits through the FA2 adder, unit 460, which outputs the Z stream,
the relevant part of the Z output stream is the first non-zero kxc2x7(mxe2x88x921) bits of Z.
The Z stream is switched into the SB register, for the first mxe2x88x921 iterations and into the SB or B register, as defined for the last iteration;
on the last iteration, the Z stream, which, disregarding the LS k zero bits, is the final B* stream. This stream is directed to the B register, to be reduced by N, if necessary, as it is used in subsequent multiplications and squares;
on the last iteration, Nd, delayed k clock cycles, is subtracted by a serial subtractor from the Z stream, to sense if the result, which goes into B, is larger than or equal to N.
At the end of this stage, all the bits from the N, B, and SB registers have been fed into the ALU, and the final k+1 bits of result are in the CSA, ready to be flushed out.
g) the device is stopped. The S flush, 250; the B flush, 240; and the N flush, 260, are set to output zero strings, to assure that in the next phase the last k+1 most significant bits will be flushed out of the CSA. (In a regular multiplication, the M7 MUX, 450, is set to accept the Last Carry from the previous iteration of S.) S has mxc2x7k+1 significant bits, but the S register has only mk cells to receive this data. This last bit is intrinsically saved in the overflow mechanism.
As was explained in e, Nd, delayed k clock cycles in 470, is subtracted from the Z stream, synchronized with the significant outputs from X, to provide a fine-tune to the sense mechanism to determine if the result which goes into the B or S register is larger than or equal to N. 480 and 490 comprise a serial comparator device, where only the last borrow command bit for modular calculations, and the (kxc2x7m+1)""th bit for regular multiplications in the natural field of integers are saved.
this overflow/borrow command is detected at the mxc2x7k""th effective clock cycle.
h) The device is clocked another k cycles, completely flushing out the CSA, while another k bits are exiting Z to the defined output register.
The instruction to the relevant flip flop commanding serial subtractor 90 or 500 to execute a subtract of N on the following exit streams is set at the last effective, (m+1)xc2x7k""th, clock cycle, of the iteration if (Z/2kxe2x88x92N)xe2x89xa7N (Z includes the mxc2x7k""th MS bit), sensed by, the following signals:
the SO1 bit, which is the data out bit from second least significant cell of the CSA, is a one,
or if the COZ bit, which is the internal carry out in the X+S adder, 460, is a one.
or if the borrow bit from the 480 sense subtractor is not set.
This mechanism appears in U.S. Pat. No. 5,513,133 as manufactured both by Motorola and SGS-Thomson.
For multiplication in the field of natural numbers, it is preferable to detect an overflow, if the mxe2x88x92k""th MS bit is a one, can happen in the superscalar multiplier, and cannot happen in the mechanism of U.S. Pat. No. 5,513,133. This overflow can then be used in the next iteration to insert a MS one in the S (temporary result) stream.
j) is this the last iteration
NO, return to c)
YES continue to m)
k) the correct value of the result can now exit from either the B or S register.
Y0 bits are anticipated in the following manner in the Y0S-Y0SENSE unit, 430, from five deterministic quantities:
i the LS bit of the Aixe2x80x94L1 Latch AND the next bit of the Bd Stream; A0xc2x7Bd;
ii the LS Carry Out bit from the Carry Save Accumulator; CO0;
iii the Sout bit from the second LS cell of the CSA; SO1;
iv the next bit from the S stream, Sd,
v the Carry Out bit from the 460, Full Adder; COZ;
These five values are XORed together to produce the next Y0 bit, Y0i:
Y0i=A0xc2x7Bd⊕CO0⊕SO1⊕Sd⊕COZ
If the Y0i bit is a one, then another N of the same rank (multiplied by the necessary power of 2), is typically added, otherwise, N, the modulus, is typically not added. Multiplication of long natural integers in the normal field of numbers.
This apparatus is suited to efficiently perform multiplications and summations of normal integers. If these operands are all of no longer than k bit length, the process preferably is executed without interleave, where the Z stream of 2k+1 bits are directed to final storage. For integers longer than k bits, the process is similar to the predescribed interleaved modular arithmetic process, excepting that the result will now potentially be one bit longer than twice the length of the longest operand. Further the apparatus of the invention is capable, using the resources available in the described device, to simultaneously perform two separate multiplications, A, multiplicand, preferably loaded in segments in the R1-Ai register, times B, the multiplier, of A, preferably loaded into the B register as previously designated, plus N, a second multiplier, preferably loaded into the N register, times an operand, C, loaded into the R2 Register, plus S, a bit stream entering the apparatus, on the first iteration, only from the Sd, signal line, preferably from the SA register. The Y0 SENSE apparatus is not used. The multiplicands are summated into the R3 register prior to the initiation of an iteration. At initiation of the iteration, registers R1, R2, and R3 are copied into latches L1, L2, and L3 until the end of an iteration. Meanwhile, during the mk+k+1 effective clock cycles of an iteration, the next segments of A and C are again preloaded and summated in preparation for the next iteration.
At each iteration, the first LS k bits of the result on the Z stream, which are, now, (not by definition zero, as in MM) directed to a separate storage, vacated to accumulate the LS portion of the result, again suitably the SA register. The most significant mk+1 bits comprise the SB, temporary quantity, for the next iteration. In the last phase, similar to g, i, and j, the CSA is flushed out of accumulated value. The LS portion, for numbers which are longer than the multiplier registers, can be exited through the normal data out register and unloader, units 60 and 30, respectively.
The MS, 2m""th bit of the result is read from the LAST CARRY bit of the FA2, unit 460, through the OVERFLOW signal line.
The present invention also relates to a compact microelectronic specialized arithmetic logic unit, for performing modular and normal (natural, non-negative field of integers) multiplication, division, addition, subtraction and exponentiation over very large integers. When referring to modular multiplication and squaring using Montgomery methods, reference is made to the specific parts of the device as a modular arithmetic coprocessor, MAP, also as relates to enhancements existing in the applicant""s U.S. Patent pending Ser. No. 09/050,958 filed Apr. 31, 1998.
Preferred embodiments of the invention described herein provide a modular computational operator for public key cryptographic applications on portable Smart Cards, typically identical in shape and size to the popular magnetic stripe credit and bank cards. Similar Smart Cards (as per technology of U.S. Pat. Nos. 5,513,133 and 5,742,530) are being used in the new generation of public key cryptographic devices for controlling access to computers, databases, and critical installations; to regulate and secure data flow in commercial, military and domestic transactions; to decrypt scrambled pay television programs, etc. Typically, these devices are also incorporated in computer and fax terminals, door locks, vending machines, etc.
The preferred architecture is of an apparatus operative to be integrated to a multiplicity of microcontroller designs while the apparatus operates in parallel with the controller. This is especially useful for long procedures that swap or feed a multiplicity of operands to and from the data feeding mechanism, allowing for modular arithmetic computations of any conventional length.
This embodiment preferably uses only one multiplying device which inherently serves the function of two multiplying devices, basically similar to the architecture described in applicant""s U.S. Pat. No. 5,513,133 and further enhanced in U.S. patent application Ser. No. 09/050,958 and PCT application PCT/IL98/0048. Using present conventional microelectronic technologies, the apparatus of the present invention may be integrated with a microcontroller with memories onto a 4 by 4.5 by 0.2 mm microelectronic circuit.
The present invention also seeks to provide an architecture for a digital device which is a peripheral to a conventional digital processor, with computational, logical and architectural novel features relative to the processes described in U.S. Pat. No. 5,513,133.
A concurrent process and a unique hardware architecture are provided, to perform modular exponentiation without division preferably with the same number of operations as are typically performed with a classic multiplication/division device, wherein a classic device typically performs both a multiplication and a division on each operation. A particular feature of a preferred embodiment of the present invention is the concurrency of operations performed by the device to allow for unlimited operand lengths, with uninterrupted efficient use of resources, allowing for the basic large operand integer arithmetic functions.
The advantages realized by a preferred embodiment of this invention result from a synchronized sequence of serial processes. These processes are merged to simultaneously (in parallel) achieve three multiplication operations on n bit operands, using one multiplexed k bit serial/parallel multiplier in (n+k) effective clock cycles. This procedure accomplishes the equivalent of three multiplication computations, as described by Montgomery.
By synchronizing loading of operands into the MAP and on the fly detecting values of operands, and on the fly preloading and simultaneous addition of next to be used operands, the apparatus is operative to execute computations in a deterministic fashion. All multiplications and exponentiations are executed in a predetermined number of clock cycles. Additional circuitry is preferably added which on the fly preloads, three first k bit variables for a next iteration Montgomery squaring sequence. A detection device is preferably provided where only two of the three operands are chosen as next iteration multiplicands, eliminating k effective clock cycle wait states. Conditional branches are replaced with local detection and compensation devices, thereby providing a basis for a simple control mechanism, which, when refined, typically include a series of self-exciting cascaded counters. The basic operations herein described are typically executed in deterministic time using a device described in U.S. Pat. No. 5,513,133 to Gressel et al or devices as manufactured by Motorola in East Kilbride, Scotland under the trade name MSC501, and by STMicroelectronics in Rousset, France, under the trade name ST16-CF54.
The apparatus of the present invention has particularly lean demands on external volatile memory for most operations, as operands are loaded into and stored in the device for the total length of the operation. The apparatus preferably exploits the CPU onto which it is appended, to execute simple loads and unloads, and sequencing of commands to the apparatus, whilst the MAP performs its large number computations. Large numbers presently being implemented on smart card applications range from 128 bit to 2048 bit natural applications. The exponentiation processing time is virtually independent of the CPU which controls it. In practice, architectural changes are typically unnecessary when appending the apparatus to any CPU. The hardware device is self-contained, and is preferably appended to any CPU bus.
In general, the present invention also relates to arithmetic processing of large integers. These large numbers are typically in the natural field of (non-negative) integers or in the Galois field of prime numbers, GF(p), and also of composite prime moduli. More specifically, a preferred embodiment of the present invention seeks to provide a device that can implement modular exponentiation of large numbers. Such a device is suitable for performing the operations of Public Key Cryptographic authentication and encryption protocols, which work over increasingly large operands and which cannot be executed efficiently with present generation modular arithmetic coprocessors, and cannot be executed securely in software implementations. The methods described herein are useful for the most popular modular exponentiation computation methods, where sequences of square and multiply have been made identical in the steps executed. Both operations are enacted simultaneously, where the unused result is switched to an unused data register segment. Mock squaring operations, often called dummy squaring operations, are performed preferably using a result of a previous square which precedes a multiplication operation, as the next multiplicand operand. If a square result is not reused, the sequence is more difficult to detect. The terms, xe2x80x9cmockxe2x80x9d or xe2x80x9cdummyxe2x80x9d are used to describe an operation in particular which acts in many ways like another operation, and in particular leaving temporary unused [trashed] results. Usually the intent is to dissuade an adversary from attempting to probe a given device. Further, the present invention seeks to modify aspects of loading and unloading operands, and the computations thereof, in order to both accelerate the system response, and to secure computations against potential attacks on public key cryptographic systems.
A preferred embodiment of the present invention seeks to provide a hardware implementation of large operand integer arithmetic. Especially as concerns the numerical manipulations in a derivative of a procedure known as the interleaved Montgomery multiprecision modular multiplication (MM) method as described herein. MM is often used in encryption software oriented systems. The preferred embodiment is of particular value in basic arithmetic operations on long operand integers; in particular, A*B+C*D+S, wherein there is no theoretical limit on the sizes of A, B, C, D, or S. In addition, a preferred embodiment of the present invention is especially attuned to perform modular multiplication and exponentiation and to perform elliptic curve scalar point multiplications over the GF(p) field.
For modular multiplication in the prime and composite field of odd numbers, A and B are defined as the multiplicand and the multiplier, respectively, and N is defined as the modulus in modular arithmetic. N, is typically larger than A or B. N also denotes the register where the value of the modulus is stored. N, is, in some instances, typically smaller than A. A, B, and N are defined as mxc2x7k=n bit long operands. Each k bit group is called a character, the size of the group defined by the size (number of cells) of the multiplying device.
Then A, B, and N are each m characters long. For ease in following the step by step procedural explanations, assume that A, B, and N are 512 bits long, (n=512); assume that k is 128 bits long because of the present cost effective length of such a multiplier, and data manipulation speeds of simple CPUs. Accordingly, m=8 is the number of characters in an operand and also the number of iterations in a squaring or multiplying loop with a 1024 bit operand. All operands are positive integers. More generally, A, B, N, n, k and m may assume any suitable values.
In non-modular functions, the N and S registers can preferably be used for temporary storage of other arithmetic operands.
The symbol, xe2x89xa1, or in some instances =, is used to denote congruence of modular numbers, for example 16xe2x89xa12 mod 7. 16 is termed xe2x80x9ccongruentxe2x80x9d to 2 modulo 7 as 2 is the remainder when 16 is divided by 7. When Y mod Nxe2x89xa1X mod N; both Y and X may be larger than N however, for positive X and Y, the remainders are identical. Note also that the congruence of a negative integer Y, is Y+uN, where N is the modulus, and if the congruence of Y is to be less than N, u is the smallest integer which gives a positive result.
The Yen symbol, ¥, is used to denote congruence in a more limited sense. During the processes described herein, a value is often either the desired value, or equal to the desired value plus the modulus. For example X¥2 mod 7. X can be equal to 2 or 9. X is defined to have limited congruence to 2 mod 7. When the Yen symbol is used as a superscript, as in B¥, then 0xe2x89xa6B¥ less than 2N, or stated differently, B¥ is either equal to the smallest positive B which is congruent to B¥, or is equal to the smallest positive congruent B plus N, the modulus.
When X=A mod N, X is defined as the remainder of A divided by N; e.g., 3=45 mod 7.
In number theory, the modular multiplicative inverse of X is written as Xxe2x88x921, which is defined by XXxe2x88x921 mod N=1. If X=3, and N=13, then Xxe2x88x921=9, i.e., the remainder of 3xc2x79 divided by 13 is 1.
The acronyms MS and LS are used to signify xe2x80x9cmost significantxe2x80x9d and xe2x80x9cleast significantxe2x80x9d, respectively, when referencing bits, characters, and full operand values, as is conventional in digital nomenclature.
Characters in this document are words which are k bits long. Characters are denoted by indexed capitals, wherein the LS character is indexed with a zero, e.g., N0 is the least significant character of N, and the MS character is typically indexed, nxe2x88x921, e.g., Nnxe2x88x921 is the most significant character of N.
Throughout this specification N designates both the value N, and the name of the shift register which stores N. An asterisk superscript on a value, denotes that the value, as stands, is potentially incomplete or subject to change. A is the value of the number which is to be exponentiated, and n is the bit length of the N operand. After initialization when A is xe2x80x9cMontgomery normalizedxe2x80x9d to A*(A*=2nAxe2x80x94to be explained later) A* and N are typically constant values throughout the intermediate step in the exponentiation. During the first iteration, after initialization of an exponentiation, B is equal to A*. B is also the name of the register wherein the accumulated value that finally equals the desired result of exponentiation resides. S or S* designates a temporary value, and S also designates the register or registers in which all but the single MS bit of S is stored. (S* concatenated with this MS bit is identical to S.) S(ixe2x88x921) denotes the value of S at the outset of the i""th iteration; S0 denotes the LS character of an S(i)""th value.
Montgomery multiplication, MM, is actually (Xxc2x7Yxc2x72xe2x88x92n) mod N, where n is typically the length of the modulus. This is written, (Axc2x7B)N, and denotes MM or multiplication in the P field. In the context of Montgomery mathematics, we refer to multiplication and squaring in the P field as multiplication and squaring operations.
The apparatus of the present invention preferably performs all of the functions described in U.S. Pat. No. 5,513,133, and in U.S. patent application Ser. No. 09/050,958, [same as PCT/IL98/00148]. with the same order of electronic gates, in less than half the number of machine clock cycles, in the first instance, and an additional savings in clock cycles in the second instance. Reduction in performance clock cycles is advantageous on short operand computations, e.g., for use in elliptic curve cryptosystems. This is mostly because there is only one double action serial/parallel multiplier instead of two half size multipliers using the same carry save accumulator (CSA, 410) mechanism. Another explanation is that many of the intrinsic hardware delays have been eliminated, and a CPU loading/unloading hardware method has been developed to greatly shorten memory to peripheral and peripheral to memory data transfers. Furthermore, an xe2x80x9con the flyxe2x80x9d preload operation has preferably replaced a time consuming preload operation for the first iteration of a squaring operation, and also replaces a complementary mock preload on a multiplication operation. In addition sequences and methods have been developed which simultaneously accelerate computations and prevent external analysis of secret operations, e.g., determining the secret exponent used in RSA signatures, or determining the secret random number used in the NIST Digital Signature Standard or in Elliptic Curve Signatures.
Much attention is addressed to dissuading adversaries from non-invasively monitoring the current dissipated in the cryptocomputer. Signal in the sense of taking such measurements is that current which is dissipated in sequences, and is used in statistical tests to determine secret values used in a computation. Pseudo-signal is in this sense, current which is dissipated, in a random or pseudo-random fashion to compensate for, and add to signal, thereby helping to deceive and adversary. Added noise is randomly generated noise, which is typically not synchronized to variations in signal. Noise in this sense is that part of the detected data, which in any way interferes with the detection of signal. Energy decoupling refers to the process of arbitrarily causing energy to be drawn from the power supply that the adversary can measure, and forcibly inserted into the circuit, irrespective of the energy dissipated in signal and pseudo-signal. The excess of this energy is preferably dissipated over the entire surface of the monolithic cryptocomputer.
A pseudo signal is defined as an intentionally superfluously generated noise that in many or all respects mocks a valid signal using similar or identical resources and synchronized to the system clocks. Pseudo-signals, which are effectively noise, can be generated simultaneously with a valid signal, or alone in a sequence.
Montgomery Modular Multiplication
A classical modular multiplication procedure consists of both a multiplication and a division process, e.g., Axc2x7B mod N where the result is the remainder of the product Axc2x7B divided by N. Implementing a conventional division of large operands is more difficult to perform than serial/parallel multiplications.
Using Montgomery""s modular reduction method, division is typically replaced by multiplications using two precomputed constants. In the procedure demonstrated herein, there is only one precomputed constant, which is a function of the modulus. This constant is, or can be, computed using this specialized arithmetic Operational Unit device.
A simplified presentation of the Montgomery process, as is used in this device is now provided, followed by a complete preferred description.
If the number is odd (an LS bit one), e.g., 1010001 (=8110) the odd number is typically transformed to an even number (a single LS bit of zero) by adding to it another fixing, compensating odd number, e.g., 1111 (=1510); as 1111+1010001=1100000 (9610). In this particular case, a number is produced five with LS zeros, because we know in advance the whole string, 81, and easily determine a binary number which we when added to 81, and produces a new binary number that has at least k LS zeros. The added in number is odd. Adding in an even number has no effect on the progressive LS bits of a result.
This is a clocked serial/parallel carry save process, where it is desired to have a continuous number of LS zeros. Thus at each clock cycle only the next bit emitting from the CSA, 410, may need a change of polarity. At each clock it is sufficient to add the fix, if the next bit is potentially a one or not to add the fix if the potential bit were to be a zero. However, in order not to cause interbit overflows (double carries), this fix is preferably summated previously with the multiplicand, to be added into the accumulator when the relevant multiplier bit is one, whenever the Y0 Sense, 430, detects a one.
Only the remainder of a value divided by the modulus is of interest. To maintain congruency it is sufficient to add the modulus any number of times to a value, and still have a value that has same remainder. This means typically that Y0xc2x7N=xcexa3yi2iN added to any integer typically produces a result with the same remainder. Y0 is typically the number of times we add the modulus, N, to the summation to produce the necessary LS zeros. As described, the modulus that is added to the value is odd.
Montgomery interleaved variations typically reduce the limited working register storage used for operands. This is especially useful when performing public key cryptographic functions where typically one large integer, e.g., n=1024 bit, is multiplied by another large integer; a process that conventionally produces a double length 2048 bit integer.
Typically a sufficient number of Ns (the moduli) are add in to Axc2x7B=X or Axc2x7B+S=X during the process of multiplications (or squaring) so that the result is a number, Z, that has n LS zeros, and, at most, n+1 MS bits.
The LS n bits may be disregarded, typically, while performing P field computations, if at each stage, the result is realized to be the natural field modular arithmetic result, divided by 2n.
When the LS n bits are disregarded, and only the most significant n (or n+1) bits are used, then effectively, the result has been multiplied by 2xe2x88x92n, the modular inverse of 2n. If subsequently this result is re-multiplied by 2n mod N (or 2n) a value is typically obtained which is congruent to the desired result (having the same remainder) as A*B+S mod N.
Example:
A*B+S mod N=(12*11+10) mod 13=(1100*1011+1010)2 mod 10112.
2i N is added in whenever a fix is necessary on one of the n LS bits.
And the result is 10001 00002mod 13=17*24 mod 13.
As 17 is larger than 13, 13 is subtracted, and the result is:
17*24xe2x89xa14*24 mod 13.
formally 2xe2x88x92n(AB+S)mod N=9 (12*11+10) mod 13xe2x89xa14
In Montgomery arithmetic only the MS non-zero result is utilized, and in the P field, it is typically assumed that the real result is divided by 2n; n zeros having been forced onto the MM.
In the example, (8+2)*13=10*13 was added in, which effectively multiplied the result by 24 mod 13xe2x89xa13. In effect, with the superfluous zeros the result is, A*B+Y*N+Sxe2x88x92(12*11+10*13+10) in one process. This process, on much longer numbers, is executable on a preferred embodiment.
Checkxe2x80x94(12*11+10) mod 13=12; 4*3=12.
To retrieve an MM result back into a desired result using the same multiplication method, the previous result is Montgomery Multiplied 22nmod N, the term which is defined as H, as each MM leaves a parasitic factor of 2xe2x88x92n.
The Montgomery Multiply function (Axc2x7B)N performs a multiplication modulo N of the Axc2x7B product into the P field. (In the above example, where we derived 4). The retrieval from the P field back into the normal modular field is performed by enacting the operator  on the result of (Axc2x7B)N using the precomputed constant H. Now, if Pxe2x89xa1(Axc2x7B)N, it follows that (Pxc2x7H)Nxe2x89xa1Axc2x7B mod N; thereby performing a normal modular multiplication in tow P field multiplications.
Montgomery modular reduction averts a series of multiplication and division operations on operands that are n and 2n bits long, by performing a series of multiplications, additions, and subtractions on operands that are n or n+1 bits long. The entire process yields a result which is smaller than or equal to N. For given A, B and odd N, there is always a Q, such that Axc2x7B+Qxc2x7N results in a number whose n LS bits are zero, or:
Pxc2x72n=Axc2x7B+Qxc2x7N
This means that the result is an expression 2n bits long, whose n LS bits are zero.
Now, let Ixc2x72n=1 mod N (I exists for all odd N). Multiplying both sides of the previous equation by I yields the following congruences:
from the left side of the equation:
Pxc2x7Ixc2x72nxe2x89xa1P mod N; (Remember that Ixc2x72n=1 mod N)
and from the right side:
Axc2x7Bxc2x7I+Qxc2x7Nxc2x7Ixe2x89xa1ABxc2x7I mod N; (Remember that Qxc2x7Nxc2x7Ixe2x89xa10 mod N)
therefore:
Pxe2x89xa1Axc2x7Bxc2x7I mod N.
This also means that a parasitic factor I=2xe2x88x92n mod N is introduced each time a P field multiplication is performed.
The  operator is defined such that:
Pxe2x89xa1Axc2x7Bxc2x7I mod Nxe2x89xa1(Axc2x7B)N.
and we call this xe2x80x9cmultiplication of A times B in the P fieldxe2x80x9d, or Montgomery Multiplication.
The retrieval from the P field can be computed by operating  on Pxc2x7H, making:
(Pxc2x7H)Nxe2x89xa1Axc2x7B mod N;
H is typically derived by substituting P in the previous congruence:
(Pxc2x7H)Nxe2x89xa1(Axc2x7Bxc2x7I)(H)(I) mod N;
(any Montgomery multiplication operation introduces the parasitic I)
If H is congruent to the multiple inverse of I2 then the congruence is valid, therefore:
H=Ixe2x88x922 mod Nxe2x89xa122n mod N
(H is a function of N and is called H parameter)
In conventional Montgomery methods, to enact the  operator on Axc2x7B, the following process may be employed, using the precomputed constant J:
1) X=Axc2x7B
2) Y=(Xxc2x7J) mod 2n (only the n LS bits are necessary)
3) Z=X+Yxc2x7N
4) S¥=Z/2n (The constraint on J is that it forces Z to be divisible by 2n)
5) P¥S mod N (N is to be subtracted from S, if Sxe2x89xa7N)
Finally, at step 5):
P¥(Axc2x7B)N,
[After the subtraction of N, if necessary:
P=(Axc2x7B)N.]
Following the above:
Y=Axc2x7Bxc2x7J mod 2n (using only the n LS bits);
and:
Z=Axc2x7B+(Axc2x7Bxc2x7J mod 2n)xc2x7N.
In order that Z be divisible by 2n (the n LS bits of Z are preferably zero) and the following congruence exists:
[Axc2x7B+(Axc2x7Bxc2x7J mod 2n)xc2x7N] mod 2nxe2x89xa10
In order that this congruence can exist, Nxc2x7J mod 2n are congruent to xe2x88x921 or:
Jxe2x89xa1xe2x88x92Nxe2x88x921 mod 2n.
and the constant J is the result.
J, therefore, is preferably a precomputed constant which is a function of N only. However, in a apparatus operative to output a MM result, bit by bit, provision is typically made to add in Ns at each instance where the output bit in the LS string would otherwise have been a zero, thereby obviating the necessity of precomputing J. Y is detected bit by bit using hardwired logic instead of precomputing Y=Axc2x7B J mod 2n. The method described is typically executable only for odd Ns.
It is to be noted that if the bit length of the MAP is equal to the bit length, n, of the modulus, only one iteration is necessary to perform a multiplication or a square. In reality the whole computation is performed in approximately n (the length of the operands) effective clock cycles. However, the last n effective clock cycles, in this embodiment, are necessary to flush the result out of the Carry Save Accumulator and also to perform the xe2x80x9cCompare to Nxe2x80x9d which sets the borrow detect. Another preferred embodiment can be constructed wherein a parallel compare can be executed in one clock cycle, and the result left in a MAP register which can serve both as a result and an operand register.
Therefore, as is apparent, the process described employs three multiplications, one summation, and a maximum of one subtraction for the given A, B, N. Computing in the P field typically requires an additional multiplication by a constant to retrieve (Axc2x7B)N into the natural field of modular arithmetic integers. As A can also be equal to B, this basic operator can be used as a device to square or multiply in the modular arithmetic.
Interleaved Montgomery Modular Multiplication is Now Described:
The previous section describes a method for modular multiplication which involved multiplications of operands that were all n bits long, and results which typically occupied 2n+1 bits of storage space.
Using Montgomery""s interleaved reduction as described previously, it is possible to perform the multiplication operations with shorter operands, registers, and hardware multipliers; enabling the implementation of an electronic device with relatively few logic gates.
First, if at each iteration of the interleave, using the device of U.S. Pat. No. 5,742,530, the number of times that N is added is preferably computed, using the J0 constant. To interleave, using a hardwire derivation of Y0, preferably eliminates the J0-phase of each multiplication {2) in the following example}. Eliminating the J0 phase enables integration of the functions of two separate serial/multipliers into the new single generic multiplier which preferably performs Axc2x7B+Y0xc2x7N+S at better than double speed of previous similar sized devices.
Using a k bit multiplier, it is convenient to define characters of k bit length, there are m characters in n bits; i.e., mxc2x7k=n.
J0 is defined as the LS character of J.
Therefore:
J0xe2x89xa1xe2x88x92N0xe2x88x921 mod 2k (J0 exists as N is odd).
Note, the J and J0 constants are compensating numbers that when enacted on the potential output, tell how many times to add the modulus, in order to have a predefined number of least significant zeros. Following is a description of an additional advantage to the present serial device; since, as the next serial bit of output can be easily determined, it is preferred to add the modulus (always odd) to the next intermediate result. This is the case if, without this addition, the output bit, the LS serial bit exiting the CSA, is typically a xe2x80x9c1xe2x80x9d. Adding in the modulus to the previous even intermediate result, and thereby typically outputs another LS zero into the output string. Congruency is maintained, as no matter how many times the modulus is added to the result, the remainder is constant.
In the conventional use of Montgomery""s interleaved reduction, (AB)N is enacted in m iterations as described in steps (1) to (5):
Initially, S(0)=0 (the ¥ value of S at the outset of the first iteration).
For i=1, 2 . . . m:
1) X=S(ixe2x88x921)+Aixe2x88x921xc2x7B (Aixe2x88x921 is the ixe2x88x921 th character of A; S(ixe2x88x921) is the value of S at the outset of the i""th iteration.)
2) Y0=X0xc2x7J0 mod 2k (The LS k bits of the product of X0xc2x7J0) (The process computes the k LS bits only, e.g., the least significant 128 bits)
In the preferred implementation, this step is hidden, as in this systolic device, Y0 can be anticipated bit by bit.
3) Z=X+Y0xc2x7N
4) S¥(i)=Z/2k (The k LS bits of Z are always 0, therefore Z is always divisible by 2k. This division is tantamount to a k bit right shift as the LS k bits of Z are all zeros; or as is seen in the circuit, the LS k bits of Z are simply disregarded).
5) S(i)=S¥(i) mod N (N is to be subtracted from those S(i)""s which are larger than N).
Finally, at the last iteration (after the subtraction of N, when necessary), C=S¥(m)=(Axc2x7B)N. To derive F=Axc2x7B mod N, the P field computation, (Cxc2x7H)N, is performed.
It is desired to know, in a preferred embodiment, that for all S¥(i)""s, S¥(i) is smaller than 2N. This also means, that the last result (S¥(m)) can always be reduced to a quantity less than N with, at most, one subtraction of N.
For operands which are used in the process:
S¥(ixe2x88x921) less than 2n+1 (the temporary register can be one bit longer than the B or N registerxe2x80x94in this MAP Sd is always less than N),
B less than N less than 2n and Aixe2x88x921 less than 2k.
By definition:
S¥(i)=Z/2k (The value of S at the end of the process, before a possible subtraction, 0 less than i less than n)
For all Z output, Z(i) less than 2n+k+1; maximum output results for Nmax=2nxe2x88x921
Xmax=S¥max+Aixc2x7B less than 2n+1xe2x88x921+(2kxe2x88x921)(2nxe2x88x922)[Real S less than N]
Qmax=Y0N less than (2kxe2x88x921)(2nxe2x88x921)
therefore:
Zmax=Xmax+Qmax=2n+k+1xe2x88x922k+1xe2x88x922k+3
S¥ less than 2nxe2x88x921xe2x88x922.
S¥(m)maxxe2x88x92Nmax less than (2n+1xe2x88x922)xe2x88x92(2nxe2x88x921)=2nxe2x88x921.
Similarly, for the lower extremum, where Nmin=2nxe2x88x921+1, Smax less than 2Nmin.
Example of a Montgomery Interleaved Modular Multiplication:
The following computations in the hexadecimal format clarify the meaning of the interleaved method:
N=a59, (the modulus), A=99b, (the multiplier), B=5c3 (the multiplicand), n=12, (the bit length of N), k=4, (the size in bits of the multiplier and also the size of a character), and m=3, as n=kxc2x7m.
J0=7 as 7xc2x79xe2x89xa1xe2x88x921 mod 16 and Hxe2x89xa122xc2x712 mod a59xe2x89xa144b.
The expected result is Fxe2x89xa1Axc2x7B mod Nxe2x89xa199bxc2x75c3 mod a59xe2x89xa1375811 mod a59=22016.
Initially: S(0)=0
Step 1
X=S(0)+A0xc2x7B=0+b5c3=3f61
Y0=X0xc2x7J0 mod 2k=7 (Y0-hardwire anticipated in MAP)
Z=X+Y0xc2x7N=3f61+7xc2x7a59=87d0
S(1)=Z/2k=87d
Step 2
X=S(1)+A1xc2x7B=87d+9xc2x75c3=3c58
Y0=X0xc2x7J0 mod 2k=8xc2x77 mod 24=8 (Hardwire anticipated)
Z=X+Y0xc2x7N=3c58+52c8=8f20
S(2)=Z/2k=8f2
Step 3
X=S(2)+A2xc2x7B=8f2+9xc2x75c3=3ccd
Y0=dxc2x77 mod 24=b (Hardwire anticipated)
Z=X+Y0xc2x7N=3ccd+bxc2x7a59=aea0
S(3)=Z/2k=aea,
as S(3) greater than N,
S(m)=S(3)xe2x88x92N=aeaxe2x88x92a59=91
Therefore C=(Axc2x7B)N=9116.
Retrieval from the P field is performed by computing (Cxc2x7H)N: 
Again initially: S(0)=0
Step 1
X=S(0)+C0xc2x7H=0+1xc2x744b=44b
Y0=d (Hardwire anticipated in new MAP)
Z=X+Y0xc2x7N=44b+8685=8ad0
S¥(1)=Z/2k=8ad ;S¥(1)=S(1) less than N.
Step 2
X=S(1)+C1xc2x7H=8ad+9xc2x744b=2f50
Y0=0 (Hardwire anticipated in new MAP)
Z=X+Y0xc2x7N=2f50+0=2f50
S¥(2)=Z/2k=2f5 ;S¥(2) less than N
Step 3
X=S(2)+C2xc2x7H=2f5+0xc2x744b=2f5
Y0=3 (Hardwire anticipated in new MAP)
Z=X+Y0xc2x7N=2f5+3xc2x7a59=2200
S¥(3)=Z/2k=22016, S¥(3) less than N
which is the expected value of 99b 5c3 mod a59.
If at each step k LS zeros are disregarded, the result is tantamount to having divided the n MS bits by 2k. Likewise, at each step, the i""th segment of the multiplier is also a number multiplied by 2ik, giving it the same rank as S(i).
The following explains a sequence of squares and multiplies, which implements a modular exponentiation.
After precomputing the Montgomery constant, H=22n, as this device can both square and multiply in the P field, it is possible to compute:
xe2x80x83C=AE mod N.
Let E(j) denote the j bit in the binary representation of the exponent E, starting with the MS bit whose index is 1 and concluding with the LS bit whose index is q, the process is as follows for odd exponents:
A*¥(Axc2x7H)N A* is now equal to Axc2x72n.
B=A*
FOR j=2 TO qxe2x88x921
B¥(Bxc2x7B)N 
IF E(j)=1 THEN
B¥(Bxc2x7A*)N 
ENDFOR
B¥(Bxc2x7A)N E(0)=1; B is the last desired temporary result multiplied by 2n, A is the original A.
C¥=B
C=C¥xe2x88x92N if C¥xe2x89xa7N.
After the last iteration, the value B is ¥ to AE mod N, and C is the final value.
To clarify, note the following example:
E=1011xe2x86x92E(1)=1; E(2)=0; E(3)=1; E(4)=1;
To find A1011 mod N; q=4
A*=(Axc2x7H)N=AIxe2x88x922 I=AIxe2x88x921 mod N
B=A*
FOR j=2 to q
B=(Bxc2x7B)N which produces: A2(Ixe2x88x921)2xc2x7I=A2xc2x7Ixe2x88x921 
E(2)=0; B=A2xc2x7Ixe2x88x921 
j=3 B=(Bxc2x7B)N=A2(Ixe2x88x921)2xc2x7I=A4xc2x7Ixe2x88x921 
E(3)=1B=(Bxc2x7A*)N=(A4xc2x7Ixe2x88x921)(AIxe2x88x921)xc2x7I=A5xc2x7Ixe2x88x921 
j=4 B=(Bxc2x7B)N=A10xc2x7Ixe2x88x922xc2x7I=A10xc2x7Ixe2x88x921
As E(4) was odd, the last multiplication is by A, to remove the parasitic Ixe2x88x921.
B=(Bxc2x7A)=A10xc2x7Ixe2x88x921xc2x7Axc2x7I=A11 
C=B
Apparatus for accelerating the modular multiplication and exponentiation process is preferably provided, including means for precomputing the necessary single Montgomery constant, H=22n mod N; where n is the bit length of the operand, and N is the modulus.
An exhaustive search, or a brute force attack, is an attack where the hacker knows the encryption scheme, and is able to break the scheme by trying all possible keys. In the event that the hacker is able, by physical means, to find parts of the sequence; an exhaustive search then consists of an orderly trial and error sequence of tests to determine a sequence. Exhaustive search cryptographic attacks are considered intractable if the hacker is forced to execute, on the average, at least 280 trials in order to learn a correct sequence.
The number of trials that make a method intractable, is obviously machine dependent. Diffie"" conjectures [Whitfield Diffie and Susan Landau, xe2x80x9cPrivacy on the Linexe2x80x9d, MIT Press, Cambridge, 1998 page 27, hereinafter, Diffie].states that a method of breaking a code, used by a hacker who has access to a very large percentage of the world""s computing power, typically needs more than 290 trials to be intractable for the foreseeable future. Diffie notes that to execute 2120 trials would take 30,000 years with 1012 dedicated processors each of which performs a procedural test on a secret in a picosecond. This Diffie estimates is sufficiently strong for the indefinite future. Most researchers today believe that 280 trials pose an intractable problem. [A. J. Menezes, P C van Oorschot, S. A. Vanstone, Handbook of Applied Cryptography, CRC Press, Boca Raton, 1997, Chapter 4, 4.49]. ANSI Standard X9.31-1997 page 25 specifies 2100 iterations for banking use, which typically covers certificates from CAs [Certification Authorities].
System security in an RSA environment is dependent on the strength of the CA""s secret key. These are typically long lasting, as they are preferably masked into all devices in the system. Devices with the CA""s secret keys are preferably kept in a well protected environment, are not subject to reverse engineered or non-invasive attacks. However, an ordinary financial smart card, with preset reasonable credit limits and a maximum lifetime of four years typically is not be the target of a costly search and as it is typically based on a lower level of security. However, an insufficiently secured banker""s certificate is a potential victim for an exhaustive search attack. A satellite television descrambler in a xe2x80x9cpay for what you seexe2x80x9d system that includes a potential non-paying audience of millions is a likely target for a hacker intent on cloning, as a cloned RSA smart card is typically as useful as an original card.
In an Internet disclosure, xe2x80x9cIntroduction to Differential Power Analysis and Related Attacksxe2x80x9d, by Paul Kocher, Joshua Jaffe and Benjamin Jun, Cryptography Research, San Francisco, 94102, www.cryptography.com, 1998, hereinafter, Kocher, a disclosure of methods which Kocher uses to learn cryptographic secrets in monolithic cryptocomputers of varied designs. The Kocher attacks are similar in principle but more refined in practice than previous noninvasive attacks on cryptocomputing devices. In the most refined attacks, the hacker has accurate previous knowledge of the device, the computational methods used and the hacker preferably has complete access to the software or firmware, which executes the computational method using a secret key.
In Differential Power Analysis, DPA, or any other probing method for learning cryptographic secrets, signal is referred to as the conglomerate of externally detectable features. In DPA a digitally recorded mapping over time of instantaneous current consumption transmitted by the relevant electronic components of the MAP and the host CPU while computing a cryptographic sequence traces such signal. Noise in this sense is that part of the detected data, which in any way interferes with the detection of signal. A pseudo signal is defined as an intentionally superfluously generated noise that in many or all respects mocks a valid signal using similar or identical resources. Pseudo-signals, which are effectively noise, can be generated simultaneously with a valid signal, or alone in a sequence.
As most professional rogue hackers, and most security testing laboratories typically have preliminary knowledge of the cryptocomputer and the firmware drivers, judicious designers and programmers always assume that adversaries have access to extensive resources. These adversaries have the means to reverse engineer silicon designs. These adversaries gain access to firmware, either by physically attacking the ROM or by obtaining necessary data from developers, disgruntled employees, hacking tips on an Internet bulletin boards or from another hacker who had access to an unprotected version of a cryptocomputer. Types of data that are preferably well protected are the crypto-secret keys, secret moduli, internally generated random numbers, and other secrets that are internally generated. They are preferably protected so that the programmer, the manufacturer or his employees or the cryptocomputer owner himself, do not have access to these secrets.
In most cryptographic methods, secret keys can be extricated by learning the sequence of operations performed by the cryptocomputer, and or the sequence of serial operations performed in the execution thereof.
In anticipated attacks, a plurality of devices under test simultaneously execute the same cryptographic command, on each cryptocomputer under test, and statistically learn the features of each operation in the sequence. In the simplest form, this could be an elementary timing attack to learn the sequence of squares and multiplies. In many cryptocomputers, the time to execute a squaring is approximately one half of the time necessary to execute a multiplication. A graph, as can be observed on an oscilloscope with memory, of the current consumed during a computation, is generally a sequence of disfigured bell-shapes, corresponding to the sequence of squares and multiplies. In this simplest attack, smaller bells typically represent squares and larger bells typically represent multiplications. The above described sequence of time dependent unmasked current consumption can graphically be described as a ragged skewed flat top bell, rising more quickly on the first phase of a squaring or multiplication computation, with notches of lowered consumption at phase and drastic computational changes, and finally, a fast receding decrease during the final phase of a sequence, as the CSA is being flushed out. These changes, when not carefully masked, clearly mark the status of the MAP during an iteration and can aid a hacker to synchronize onto a computational sequence.
If a hacker can learn a sequence of squares and multiplies in a secret RSA exponent, he can extricate the composite primes of the public modulus. With this knowledge a usable counterfeit cryptocomputer can be fabricated, with the extricated secret keys.
Obviously, if the chip designer has developed a procedure wherein the time and microcode sequence of squaring and multiplying are identical, a simple timing attack is typically impossible, and the adversary typically utilizes more esoteric detection techniques. As there are twice as many squaring operations as multiplications in a random sequence, this means that a combination of statistically established features, might recognize either the exponent sequence, or directly the value of the whole or part of the modulus. Learning such features, using statistical methods, entails extensive testing. A preliminary line of defense against such attacks may well be putting a lock on the number of cryptographic sequences which can be performed, before allowing acquiring an additional license, an unlock from the Certification Authority.
A preferred method for camouflaging and accelerating the squaring sequence in an exponentiation procedure is now described:
In the MAP designs of U.S. Pat. Nos. 5,742,530, 5,513,133, and the PCT patent application PCT/IL98/00148, now published, prior to each Montgomery squaring procedure, the MAP ceased computing, as the first LS k bits of the squaring multiplicand is preferably loaded into BAISR preload register. As in previous patent implementations the first serial/parallel multipliers were only 32 bits, and there were few competing designs this delay was not considered inordinately wasteful. With a 128 bit CSA, on short operands, (as are to be found in elliptic curve computations), this loading delay can account for more than 10% of procedure time in an exponentiation.
The hardware of the present invention carries out modular multiplication and exponentiation by applying Montgomery arithmetic in a novel way. Further, the squaring can be carried out in the same method, by applying it to a multiplicand and a multiplier that are equal. Modular exponentiation involves a succession of modular multiplications and squarings, and therefore is carried out by a method which comprises the repeated, suitably combined and oriented application of the aforesaid multiplication, squaring and exponentiation methods.
Final results of a Montgomery type multiplication (MM) may be larger than the modulus, but smaller than twice the modulus. In a preferred embodiment, the MAP devices can only determine the range of the result from the serial comparator, at the end of the last clock cycle of the MAP computation. In previous implementations the preload registers of the MAP were loaded in a separate k effective clock sequence, prior to the next computation, where k is the number of single bit cells in the Carry Save Accumulator (CSA), 410, which is central to the computational unit. As the drawn sizes of silicon became smaller, and factoring techniques became more sophisticated, the number of k bits in a CSA preferably becomes larger, and in a first version of this design the CSA is 128 bits long. In a less efficient and less timing wise secure procedure, the MAP does not compute whilst the first multiplicand is preloaded for a squaring operation. This preload operation in an apparatus with a 128 bit CSA causes a 128 effective clock cycle delay, and a proportionally larger loss of performance in the total process. This delay only appears naturally in the first iteration of a squaring sequence, where both the multiplicand and the multiplier are identical.
In a multiplication sequence this next original multiplicand character is preferably preloaded whilst the MAP is performing a previous squaring operation. However, if a programmer allows timing or energy differences between multiplication and squaring, the timing and energy dissipation features help a hacker learn secret square and multiplication sequences in an exponentiation procedure using non-invasive methods. It is always to be assumed that adversaries attempt to detect these and other features that indicate a process in a sequence. These differences and features are preferably eliminated or masked.
A preferred embodiment eliminates the delay caused by the wait for compare of size of the first character of the multiplicand in a squaring sequence and is achieved by preloading the first characters of the natural output of the CSA, during the end of a previous square or multiply. These characters are S0 which is the LS character from Z/2k, and (Sxe2x88x92N)0 which is (Z/2kxe2x88x92N)0. These characters are serially loaded into preload buffers Y0B0SR, 6350, and BAISR, 6290. At the end of the previous sequence, when the range of the result is determined, the proper values are latched into the parallel multiplicand registers. It is shown in the ensuing description, how the correct multiplicands are preferably derived in a hardware implementation.
This delay state is caused by the necessity to wait until the modulus is subtracted from the whole result stream in the serial comparator/detector. Only on the last MS bit of the result does the borrow/overflow detector, 490, typically flag the control mechanism to denote whether the result is larger than the modulus. In the embodiments of U.S. Pat. No. 5,713,133 and 5,742,530, only after the smallest positive congruence of the result is determined is it possible to load the first character of the squaring multiplicand. So as not to disclose the difference between a square and a multiply to an adversary who is intent on learning an exponentiation sequence using a simple timing attack, this idle period preferably also prefaces a multiplication sequence.
In a squaring operation the value in the multiplier register furnishes the values for both the multiplier and the multiplicand. If the squaring multiplier value is larger than the modulus, the modulus value is serially subtracted from the larger than modulus squaring value as the multiplier stream exits the multiplier register.
In the previous patented devices, the MAP process was halted while the first k bits were loaded after modular reduction, into the multiplicand register for the next squaring operation. As subsequent k bit multiplicand operands are modular reduced if necessary and preloaded on the fly during the squaring operation, this delay was necessitated only on the first iteration of a squaring procedure.
A primary step in masking squares and multiplication is to execute a squaring operation in a mode wherein all rotating registers and computational devices are exercised in exactly the same manner for squaring and multiplying, the only difference being the settings of data switches which choose relevant data for computation and not using [trashing] the irrelevant data.
In a preferred embodiment, the first iteration of a squaring operation, performing B0xc2x7B+Y0xc2x7N, can be accelerated and masked, when using the two outputs, B¥0 and B¥0xe2x88x92N0, of the last iteration of either a squaring or multiplication operation which precedes the squaring operation which is to be masked and accelerated.
Finding the proper carry bit, c, when c2k+S0¥=S0+N0 is loaded on the fly from the MAP is not obvious. This explicit summation is not performed in the MAP. The carry bit, c, is determined when S¥xe2x89xa7N, [assume that k=128] and the summation performed is:
Z1=S0¥={(AiB+Y0N+Sxe2x88x92)mod 22k}div 2k
[Sxe2x88x92is the temporary summation from the previous iteration.]
There is further provided in accordance with yet another preferred embodiment of the present invention a method for at least partially preventing leakage of secret information as a result of a probing operation on a cryptocomputer performing secret sequences, the method includes the step of decoupling the power supply to the cryptocomputer from the external power source wherein the cryptocomputer operates from an intermediary independent regulator dissipating excess energy.
Further in accordance with a preferred embodiment of the present invention, the intermediary stage of the power supply has a programmable energy dissipator operative to mask from a probing device the energy expended by the cryptocomputer.
Still further in accordance with a preferred embodiment of the present invention, the energy dissipator is designed to dissipate in a time dependent mode, variable amounts of energy.
There is also provided in accordance with yet another preferred embodiment of the present invention a method for at least partially preventing leakage of secret information as a result of a probing operation on a cryptocomputer performing modular exponentiation, the method includes the step of causing a balanced number of changes of status from one to zero and zero to one in an interacting shift register to shift register loading and unloading sequence.
Further in accordance with a preferred embodiment of the present invention, causing a binary change of value in a second not valid circuit, at each instance that the valid circuitry does not enact a change of binary value.
Still further in accordance with a preferred embodiment of the present invention, causing the combination of the not valid circuit together with the valid circuitry to expend an amount of energy to complement an approximate average maximum amount of energy that the valid circuitry could potentially draw.
There is also provided in accordance with a preferred embodiment of the present invention a method for at least partially preventing leakage of secret information as a result of a probing operation on a cryptocomputer performing elliptic curve point addition and point doubling, the method includes causing a balanced number of changes of status from one to zero and zero to one in an interacting shift register to shift register loading and unloading sequence.
Preferably, for at least partially preventing leakage of secret information as a result of a probing operation on a cryptocomputer where logic circuitry causes a binary change of value in a not valid circuit, at each instance that the valid circuitry does not enact a change of binary value.
Further in accordance with a preferred embodiment of the present invention, the not valid circuitry is another shift register configured so that the two registers operate together to expend an amount of energy to complement an approximate average maximum amount of energy that the valid circuitry could potentially draw.
There is further provided in accordance with yet another preferred embodiment of the present invention, a method for at least partially preventing leakage of secret information as a result of an energy probing operation on a cryptocomputer performing modular exponentiation, the method includes the step of causing a nearly constant current consumption when moving a data word from one data store to another, irrelevant of the previous status of the data source and the data destination.
There is further provided in accordance with yet another preferred embodiment of the present invention, a method for at least partially preventing leakage of secret information as a result of a probing operation on a cryptocomputer performing modular exponentiation, the method includes inserting mock square operations in difficult to detect positions in an exponentiation sequence.
There also provided in accordance with a preferred embodiment of the present invention a method for accelerating and at least partially preventing leakage of secret information as a result of a probing operation on a cryptocomputer performing modular exponentiation, the method includes the step of a multiplication procedure using addition chain procedures, wherein a plurality of single multiplication operations of the base value times the result of a previous squaring operation are replaced by single multiplications of small multiples of the base value times a previous squaring operation.
Further in accordance with a preferred embodiment of the present invention, the step of exponentiation sequence of squaring and multiplication operations is masked includes the steps of: causing mock squaring operations, normal squaring operations and multiplication operations to be identical in number of clock cycles and the amounts of energy consumed during each clock cycle of each operation are statistically similar.
There is further provided in accordance with yet another preferred embodiment of the present invention a method for at least partially preventing leakage of secret information as a result of probing operation of a cryptocomputer performing scalar multiplication of a point on an elliptic curve, including storing precomputed values of consecutive small integer multiples of the initial point value and performing elliptic curve point additions using these multiples of the initial point value and in the sequence to replace many single point addition operations.
Further in accordance with a preferred embodiment of the present invention, the method includes an addition type operation is performed at regular intervals in the scalar point multiplication sequence, and also a mock addition operation enacted when an addition operation is not necessary in the regular interval of the sequence.
Still further in accordance with a preferred embodiment of the present invention the addition type operations, and the mock point addition operation of claim 41 are masked to be almost identical in number of clock cycles and dissipate statistically similar amounts of energy during each clock cycle of each operation.
There is also provided in accordance with a preferred embodiment of the present invention, a method for accelerating and masking a first iteration in a later modular squaring operation, B0xc2x7B+Y0xc2x7N, performed on an output, B¥0 and B¥0xe2x88x92N0, of the last iteration of an earlier modular multiplication operation, each operation including a plurality of iterations, wherein an output of the last iteration of the earlier operation comprises a partially unknown quantity whose least significant portion comprises a multiplicand for the first iteration of the later operation, the partially unknown quantity having two possible values, one of which is B0, the two possible values including a smaller multiplicand value and a larger multiplicand value which is one modulus value, N, greater than the smaller multiplicand value, the method includes the steps of: during the last iteration of the earlier operation, on-the-fly extricating of the least significant portions of both possible values of the multiplicand for the later operation""s first iteration, summing the least significant portion of the larger multiplicand value with a least significant portion of the modulus, thereby to obtain a least significant portion of a largest multiplicand value which is one modulus value greater than the larger multiplicand value, and from among the three least significant portions, selecting the least significant portions of the two positive multiplicand values as B0 and B0+N0, relating to the first iteration of the later modular squaring operation.
Further in accordance with a preferred embodiment of the present invention, the extricating and summing steps in preparation for a squaring process and the process of preparing for a multiplication process are performed simultaneously.
Still further in accordance with a preferred embodiment of the present invention, the method also includes the extrication process and the preparation procedure for performing a multiplication are made almost identical in timed processing and energy consumption.
There is further provided in accordance with a preferred embodiment of the present invention, circuitry and method of utilizing a rotating shift register to generate programmable modulated random noise including tapped outputs of cells in the shift register each tap capable of generating fixed amounts of noise.
Further in accordance with a preferred embodiment of the present invention, the noise generated by each cell is conditioned by the binary data output of the cell wherein, the rotating data sequence in the shift register is computed to generate a predetermined range of random noise.
There is also provided in accordance with a preferred embodiment of the present invention, a method for at least partially preventing leakage of secret information as a result of a probing operation on a cryptocomputer performing modular exponentiation, the method includes anticipating specific clock cycles in an iteration wherein the average current consumption is less than a maximum value and partially masking this lowered average energy consumption with a random superfluous temporal consumption of energy whose average value is similar to the difference between the anticipated lowered average energy consumption.
There is further provided in accordance with a preferred embodiment of the present invention, a method for accelerated loading of data, from a plurality of memory addresses in a CPU having an accumulator, to a memory-mapped destination, the method includes the steps of: setting the memory-mapped destination to read said data, sending data which is desired to be loaded into the memory-mapped destination, from the memory address to the accumulator, and subsequent to such data having been snared by the memory-mapped destination, setting the memory-mapped destination to cease reading said data.
There is also provided in accordance with a preferred embodiment of the present invention, a method for accelerated loading of data from a memory-mapped source to a plurality of memory addresses associated with a CPU, the method includes the steps of: sending a first command from the CPU to disable the CPU""s accumulator""s connection to the CPU""s data bus, and thereby providing a cue to the memory-mapped source to unload its data onto the data bus to be read by the memory at addresses specified in, performing a series of subsequent move from accumulator to specific memory destination commands, when at each command data is moved from the source address to the specific memory destination address; and until, a data batch has been transferred, after which a command is transmitted by the CPU to re-enable the accumulator""s data connection with said data bus; and also to cause the memory-mapped destination to cease unloading its data onto the data bus.