The present application is directed to a method and apparatus for performing encryption and decryption. The application discloses several inventions relating to an overall system for the use of exponentiation modulo N as a mechanism for carrying out the desired cryptological goals and functions in a rapid, efficient, accurate and reliable manner. A first part of the disclosure is related to the construction of a method and its associated apparatus for carrying out modular multiplication. A second part of the disclosure is directed to an improved apparatus for carrying out modular multiplication through the partitioning of the problem into more manageable pieces and thus results in the construction of individual identical (if so desired) Processing Elements. A third part of the disclosure is directed to the utilization of the resulting series of Processing Elements in a pipelined fashion for increased speed and throughput. A fourth part of the disclosure is directed to an apparatus and method for calculating a unique inverse operation that is desirable as an input step or stage to the modular multiplication operation. A fifth part of the disclosure is directed to the use of the modular multiplication system described herein in its originally intended function of performing an exponentiation operation. A sixth part of the disclosure is directed to the use of the Chinese Remainder Theorem in conjunction with the exponentiation operation. A seventh part of the this disclosure is directed to the construction and utilization of checksum circuitry which is employed to insure reliable and accurate operation of the entire system. The present application is particularly directed the invention described in the second part of the disclosure.
More particularly, the present invention is directed to circuits, systems and methods for multiplying two binary numbers having up to n bits each with the multiplication being modulo, N an odd number. In particular, the present invention partitions one of the factors into m blocks with k bits in each block with the natural constraint that mkxe2x89xa7n+2. Even more particularly, the present invention is directed to multiplication modulo N when the factors being multiplied have a large number of bits. The present invention is also particularly directed to the use of the modular multiplication function hardware described herein in the calculation of a modular exponentiation function for use in cryptography. Ancillary functions, such as the calculation of a convenient inverse and a checksum mechanism for the entire apparatus are also provided herein. The partitioning employed herein also results in the construction of Processing Elements which can be cascaded to provide significant expansion capabilities for larger values of N. This, in turn, leads to a modality of Processor Element use in a pipelined fashion. The cascade of Processor Elements is also advantageously controllable so as to effectively partition the Processor Element chain into separate pieces which independently work on distinct and separate factors of N.
Those wishing an optimal understanding from this disclosure should appreciate at the outset that the purpose of the methods and circuits shown herein is the performance of certain arithmetic functions needed in modem cryptography and that these operations are not standard multiplication, inversion and/or exponentiation, but rather are modulo N operations. The fact that the present application is directed to modular arithmetic circuits and methods, as opposed to standard arithmetic operations, is a fact which would be best to keep firmly in mind, particularly since modular arithmetic, with it implied division operations, is much more difficult to perform and to calculate, particularly where exponentiation modulo N is involved.
In a preferred system for implementation which takes advantage of certain aspects of the present invention, this application is also directed to a circuit and method of practice in which an adder array and a multiplier array are effectively partitioned into in a series of nearly identical processor elements with each processor element (PE) in the series operating on a sub-block of data. The multiplier array and adder array are thus partitioned. Thus, having recognized the ability to reconfigure the generic structure into a plurality of serially connected processor elements, the present invention is also directed to a method of operation in which each processor element operates as part of a pipeline over a plurality of operational cycles. The pipelining mode of operation is even further extended to the multiplication of a series of numbers in a fashion in which all of the processor elements are continuously actively generating results.
The multiplication of binary numbers module N is an important operation in modem, public-key cryptography. The security of any cryptographic system which is based upon the multiplication and subsequent factoring of large integers is directly related to the size of the numbers employed, that is, the number of bits or digits in the number. For example, each of the two multiplying factors may have up to 1,024 bits. However, for cryptographic purposes, it is necessary to carry out this multiplication modulo a number N. Accordingly, it should be understood that the multiplication considered herein multiplies two n bit numbers to produce a result with n bits or less rather than the usual 2 n bits in conventional multiplication.
However, even though there is a desire for inclusion of a large number of bits in each factor, the speed of calculation becomes significantly slower as the number of digits or bits increase. However, for real-time cryptographic purposes, speed of encryption and decryption are important concerns. In particular, real-time cryptographic processing is a desirable result.
Different methods have been proposed for carrying out modular multiplication. In particular, in an article appearing in xe2x80x9cThe Mathematics of Computation,xe2x80x9d Vol. 44, No. 170, April 1985, pp. 519-521, Peter L. Montgomery describes an algorithm for xe2x80x9cModular Multiplication without Trial Division.xe2x80x9d However, this article describes operations that are impractical to implement in hardware for a large value of N. Furthermore, the method described by Montgomery operates only in a single phase. In contrast, the system and method presented herein partitions operational cycles into two phases. From a hardware perspective, the partitioning provides a mechanism for hardware sharing which provides significant advantages.
In accordance with a preferred embodiment of the present invention, an initial zero value is stored in a result register Z0. The integers A and B which are to be multiplied using the present process are partitioned into m blocks with k bits in each block. The multiplication is carried out modulo N. Additionally, the value R is set equal to 2k. In this way, the integer A is representable as A=Amxe2x88x921Rmxe2x88x921+ . . . +A2R2+A1R2+A0. This is the partitioning of the integer A into m blocks.
In one embodiment of the present invention, a method and circuit are shown for computing a function Z=f(Ai B)=AB 2xe2x88x92mk mod N. Later, it will be shown how this function is used to calculate AB mod N itself.
The system, methods, and circuits of the present invention are best understood in the context of the underlying algorithm employed. Furthermore, for purposes of understanding this algorithm, it is noted that modular computation is carried out modulo N, which is an odd number and n is the number of bits in the binary representation of N. Additionally, N0 represents the least significant k bits of N. Also, a constant s is employed which is equal to xe2x88x921/N0 mod R=1(Rxe2x88x92N0) mod R. With this convention, the algorithm is expressed in pseudo code as follows:
Z0=0
for i=0 to mxe2x88x921
Xi=Zi+AiB 
yi=s xi, 0 mod R (x i, 0 is the least significant k bits of Xi) 
Zi+1=(Xi+yiN)/R 
end.
There are two items to note in particular about this method for carrying out modulo N multiplication. The first thing to note is that the multiplication is based upon a partitioning of one of the factors into sub-blocks with k bits in each block. This greatly simplifies the size of multiplier arrays which need to be constructed. It furthermore creates a significant degree of parallelism which permits the multiplication operation be carried out in a much shorter period of time. The second item to note is that the partitioning also results in the splitting of the process not only into a plurality of m cycles, but also, splits the method into two phases that occur in each cycle. In the first phase (X-phase), the values Xi and yi are computed. In the second phase (Z-phase), the intermediate result value Zi+1 is calculated. It should be noted that, in the calculation of Xiand in the calculation of Zi+1, there is an addition operation and a multiplication operation. This fact allows the same hardware which performs the multiplication and addition in each of these steps to be shared rather than duplicated. With respect to the division by R in the formation of Zi+1, it is noted that this is accomplishable by simply discarding the low order k bits. Other advantages of this structure will also become apparent.
The output of the above hardware and method produces the product AB 2xe2x88x92mk mod N. To produce the more desirable result AB mod N, the method and circuit employed above is used a second time. In particular, the original output from this circuit is supplied to one of its input registers with the other register containing the factor 22mk mod N. This factor eliminates the first factor of 2xe2x88x92mk added during the first calculation and also cancels the additional factor of 2xe2x88x92mk included when the circuit is run the second time. This produces the result AB mod N.
For those who wish to practice the processes of the present invention via software, it is noted that the algorithm for multiplication provided above is readily implementable in any standard procedure-based programming language with the resulting code, in either source or object form, being readily storable on any convenient storage medium, including, but certainly not limited to, magnetic or optical disks. This process is also eminently exploitable along with the use of the exponentiation processes described below, including processes for exponentiation based on the Chinese Remainder Theorem.
In the process described above it is noted that one of the process inputs is the variable xe2x80x9csxe2x80x9d. This value is calculated as a negative inverse modulo R. In order to facilitate the generation of this input signal, a special circuit for its generation is described herein. This circuit also takes advantage of existing hardware used in other parts of a processing element. In particular, it forms a part of the rightmost processor element in a chain.
Note that, in the calculation shown above for Xi and Zi, these are more than n bit numbers. Accordingly, the multiplication and addition operations are carried out in relatively large circuits which are referred to herein as multiplier and adder arrays. In accordance with a preferred method of practicing the present invention, the adder array and multiplier array are split into sub-blocks. While this partitioning of hardware may be done using any convenient number of blocks, partitioning into blocks capable of processing k bits at a time is convenient. Thus, in the preferred embodiment, instead of employing one large multiplier array for processing two numbers having n+1 bits and k bits; with n being much greater than k, a plurality of separate k bit by k bit multipliers are employed. Additionally, it is noted that partitioning into processor element sub-blocks, while useful in and of itself particularly for circuit layout efficiency, also ultimately makes it possible to operate the circuit in several pipelined modes.
In a first pipelined mode, the circuit is operated through a plurality of cycles, m, in which adjacent processor elements are operated in alternate phases. That is, in a first pipelined mode, if a processor element is in the X-phase, its immediate neighbors are operating in the Z-phase, and vice versa. In a second pipelined mode, the pipelined operation is continued but with new entries in the input registers (A and B) which now are also preferably partitioned in the same manner as the multiplier and adder arrays.
Since n is generally much greater than k (1,024 as compared to 32, for example) and since carry propagation through adder stages can contribute significantly to processing delays, the partitioning and pipelining together eliminate this source of circuit delay and the corresponding dependence of circuit operation times on the significant parameter n whose size, in cryptographic contexts, determines the difficulty of unwarranted code deciphering.
The pipelined circuit of the present invention is also particularly useful in carrying out exponentiation modulo N, an operation that is also very useful in cryptographic applications. Such an operation involves repeated multiplication operations. Accordingly, even though pipelining may introduce an initial delay, significant improvements in performance of exponentiation operations are produced.
In one embodiment found within the disclosure herein it has been noted that the chaining together of individually operating Processing Elements introduces an addition operation in a critical timing path, that is, into a path whose delayed execution delays the whole process. The present invention provides an improvement in the design of the individual Processing Elements through the placement of this addition operation in an earlier portion of the Processing Element""s operation. In doing so, however, new control signals are also provided to make up for the fact that some signals in some of the Processing Elements are not yet available at this earlier stage and accordingly are, where convenient, provided from operations occurring or which have already occurred in adjacent Processing Elements.
The Processing Elements used herein are also specifically designed so that they may function in different capacities. In particular, it is noted that the rightmost Processing Element performs some operations that are unique to its position as the lower order Processing Element in the chain. Likewise the leftmost element has a unique role and can assume a simpler form. However, the Processing Elements employed herein are also specially designed and constructed so as to be able to adapt to different roles in the chain. In particular, the middle Processing Element is controllable so that it takes on the functional and operational characteristics of a rightmost Processing Element. In this way the entire chain is partitionable so that it forms two (or more, if needed) separate and independent chains operating (in preferred modalities) on factors of the large odd integer N.
While an intermediate object of the present invention is the construction of a modular multiplication engine, a more final goal is providing an apparatus for modular exponentiation. In the present invention this is carried out using the disclosed modular multiplier in a repeated fashion based on the binary representation of the exponent. A further improvement on this process involves use of the Chinese Remainder Theorem for those parts of the exponentiation operation in which the factors of N are known. The capability of the Processing Element chain of the present invention to be partitioned into two portions is particularly useful here since each portion of the controllably partitioned chain is able to work on each of the factors of N in an independent and parallel manner.
Since one wishes to operate computational circuits at as high a speed as possible and since this can some times lead to erroneous operations, there is provided a challenge in how to detect errors when the operations being performed are not based on standard arithmetic, but are rather based on modular arithmetic (addition, subtraction, inversion and multiplication and exponentiation). However, the present invention solves this problem through the use of circuits and methods which are not only consonant with the complicating requirements of modular arithmetic operations but which are also capable of being generated on the fly with the addition of only a very small amount of additional hardware and with no penalty in time of execution or throughput.
Accordingly, it is seen that it is an object of the present invention to produce a multiplier for multiplying two large integers modulo N.
It is yet another object of the present invention to improve the performance and capabilities of cryptographic circuits and systems.
It is a still further object of the present invention to create a multiplier circuit which operates at high speed.
It is yet another object of the present invention to create a multiplier circuit which performs multiplication modulo N without having to perform division operations.
It is also an object of the present invention to provide a multiplier which is scaleable for various values of N and n.
It is also another object of the present invention to provide a method for computing a product of two integers modulo N in a multi-phase process which permits sharing of hardware circuitry across the two phases.
It is yet another object of the present invention to provide a system and method in which the factors are partitioned into a plurality of m sub-blocks with each sub-block having k bits, whereby values for m and k are selectable so as to provide additional flexibility in hardware structure.
It is also another object of the present invention to increase the speed of multiplication calculations in cryptographic processes.
It is also an object of the present invention to provide an implementation for a multiplier circuit which uses macro components as building blocks so as to avoid the costs associated with custom design.
It is also an object of the present invention to provide a design which is flexible and scaleable.
It is also an object of the present invention to provide a word-oriented, as opposed to a bit-oriented, multiplication system and circuit.
It is a still further object of the present invention to construct a circuit for multiplication modulo N which comprises a plurality of nearly identical processor elements.
It is yet another object of the present invention to partition the multiplication of an n bit number into a plurality of pieces for quasi-independent calculation.
It is still another object of the present invention to operate the circuit herein in a pipelined mode.
It is an even further object of the present invention to operate the circuit herein so as to process sequences of distinct operands (factors) in a pipelined mode.
It is yet another object of the present invention to improve the performance of a sequence of chained Processing Elements by removing addition functions from critical paths.
It is a still further object of the present invention to operate the circuit herein so as to process sequences of identical or repeated operands in a pipelined mode, as for example, in the calculation of the exponential function modulo N.
It is yet another object of the present invention to increase the speed of exponentiation operations in cryptographic processes.
It is a still further object of the present invention to provide Processing Elements whose character as beginning, middle or end units in the chain may be controlled so as to enable the partitioning of the chain into a plurality of sub-chains each of which is capable of independent parallel processing based on a factor of N.
It is also an object of the present invention to provide a mechanism for calculating an inverse operation which is useful as an input to the method of modular multiplication employed herein.
It is yet another object of the present innovation to provide an apparatus and method for generating useful checksums which are useful for indicating that the system has operated in a proper fashion and has produced no errors.
It is a still further object of the present invention to provide a checksum circuit and method which is consonant with modular arithmetic.
It is also an object of the present invention to provide an engine which is capable of data encryption through the use of exponentiation modulo N, a large prime or the product of two large primes.
It is a further object of the present invention to provide an engine which is capable of data decryption through the use of exponentiation modulo N.
It is yet another object of the present invention to employ the Chinese Remainder Theorem to facilitate the exponentiation operation modulo N when factors for N are known.
It is also an object of the present invention to provide an encryption/decryption engine which is capable of operating in the mode of public key cryptographic systems.
It is also an object of the present invention to provide an engine which is capable of generating and receiving documents having coded digital signatures.
It is also an object of the present invention to provide an engine which is capable of generating keys to be exchanged between any two users for data encryption and decryption.
It is also an object of the present invention to produce a high-speed, high-performance cryptographic engine.
Lastly, but not limited hereto, it is an object of the present invention to provide a cryptographic engine for encryption and for decryption which can be included as part of a larger processing system and therefore possesses communication capabilities for the transfer of data and command information from other parts of a larger scale data processing system with which the present engine is coupled.
The recitation herein of a list of desirable objects which are met by various embodiments of the present invention is not meant to imply or suggest that any or all of these objects are present as essential features, either individually or collectively, in the most general embodiment of the present invention or in any of its more specific embodiments.