1. Field of the Invention
This invention relates to high-radix finite field multiplication methods and architectures, which constitutes an integral operation in cryptographic systems.
2. Field Multiplication in Cryptography
Due to the security requirements of data transfer and digitization, the current need for cryptography for data security mechanism is ever growing. The basic principle of cryptography is that a plaintext is converted into a ciphertext through encryption using a particular encryption key. When a receiver receives the ciphertext, decryption with a key that is related to the encryption key can recover the plaintext. Because the data is transferred or stored as ciphertext, data security is achieved since an adversary cannot interpret the ciphertext.
In particular, public-key cryptography enables secure communication between users that have not previously agreed upon any shared keys. This is most often done using a combination of symmetric and asymmetric cryptography: public-key techniques are used to establish user identity and a common symmetric key, and a symmetric encryption algorithm are used for the encryption and decryption of the actual messages. The former operation is called key agreement. Prior establishment is necessary in symmetric cryptography, which uses algorithms for which the same key is used to encrypt and decrypt a message. Public-key cryptography, in contrast, is based on key pairs. A key pair consists of a private key and a public key. As the names imply, the private key is kept private by its owner, while the public key is made public (and typically associated to its owner in an authentic manner). In asymmetric encryption, the encryption step is performed using the public key, and decryption using the private key. Thus the encrypted message can be sent along an insecure channel with the assurance that only the intended recipient can decrypt it.
The use of cryptographic key pairs is disclosed in the U.S. Patent of Hellman U.S. Pat. No. 4,200,770 which is incorporated herein in its entirety by reference. The Hellman patent also disclosed the application of key pairs to the problem of key agreement over an insecure communication channel. The algorithms specified in the aforementioned patent relies for their security on the difficulty of the mathematical problem of finding a discrete logarithm.
In order to undermine the security of a discrete-logarithm based crypto-algorithm, an adversary must be able to perform the inverse of finite field exponentiation (i.e., solve a discrete logarithm problem). There are mathematical methods for finding a discrete logarithm (e.g., the Number Field Sieve), but these algorithms would take an unreasonably long time using sophisticated computers when certain conditions are met in the specification of the crypto algorithm.
In particular, it is necessary for the used numbers involved to be large enough. The larger the numbers used, the more time and computing power is required to find the discrete logarithm and break the cryptosystem. On the other hand, very large numbers lead to very long public keys and transmissions of cryptographic data. The use of very large numbers also requires large amounts of time and computational power in order to perform the crypto algorithm. Thus, cryptographers are always looking for ways to minimize the size of the numbers involved, and the time and power required, in performing the encryption and/or authentication algorithms. The payoff for finding such a method is that cryptography can be done faster, cheaper, and in devices that do not have large amounts of computational power (e.g., hand-held smart-cards).
A discrete-logarithm based crypto-algorithm can be performed in any mathematical setting in which certain algebraic rules hold true. In mathematical language, the setting must be a finite cyclic group. The choice of the group is critical in a cryptographic system. The discrete logarithm problem may be more difficult in one group than in another for which the numbers are of comparable size. The more difficult the discrete logarithm problem, the smaller the numbers that are required to implement the crypto-algorithm while maintaining the same level of security. Working with smaller numbers is easier, more efficient and requires less storage compared to working with larger numbers. So, by choosing the right group, a user may be able to work with smaller numbers, make a faster cryptographic system, and get the same, or better, cryptographic strength than from another cryptographic system that uses larger numbers.
A method of adapting discrete-logarithm based algorithms to the setting of elliptic curves was disclosed independently by V. Miller, in “Use of elliptic curves in cryptography” on Advances in Cryptology—CRYPTO'85, LMCS 218, pp. 417-426, (1986) and by N. Koblitz, “A Course in Number Theory and Cryptography,” (1987). It appears that finding discrete logarithms in this kind of group is particularly difficult. Thus elliptic curve-based crypto algorithms can be implemented using much smaller numbers than in a finite-field setting of comparable cryptographic strength. Hence, the use of elliptic curve cryptography is an improvement over finite-field based public-key cryptography. The Elliptic Curve Cryptosystem relies upon the difficulty of the Elliptic Curve Discrete Logarithm Problem (ECDLP) to provide its effectiveness as a cryptosystem. Using multiplicative notation, the problem can be described as: given points B and Q in the group, find a number k such that Bk=Q; where k is called the discrete logarithm of Q to the base B. Using additive notation, the problem becomes: given two points B and Q in the group, find a number k such that kB=Q.
In an Elliptic Curve Cryptosystem, the integer k is kept private and is often referred to as the secret key. The point Q together with the base point B are made public and are referred to as the public key. The security of the system, thus, relies upon the difficulty of deriving the secret k, knowing the public points B and Q. The main factor that determines the security strength of such a system is the size of its underlying finite field. In a real cryptographic application, the underlying field is made so large that it is computationally infeasible to determine k in a straightforward way by computing all the multiples of B until Q is found.
The core of the elliptic curve geometric arithmetic is an operation called scalar multiplication which computes kB by adding together k copies of the point B. Scalar multiplication over elliptic curve is also referred to as exponentiation over elliptic curve.
Another well-known public key cryptosystem is Rivest, Shamir, Adleman (RSA) which is also based on the discrete logarithm problem of over the field GF(p). RSA requires field exponentiation over GF(p).
In what follows, field exponentiation is used to refer to scalar multiplication in elliptic curve cryptography, as well as finite field exponentiation in RSA cryptography.
A drawback of such cryptographic systems is that calculation of field exponentiation remains a daunting mathematical task even to an authorized receiver using a high speed computer. With the prevalence of public computer networks used to transmit confidential data for personal, business and governmental purposes, it is anticipated that most computer users will want cryptographic systems to control access to their data. Despite the increased security, the difficulty of finite field exponentiation calculations will substantially drain computer resources and degrade data throughput rates, and thus represents a major impediment to the widespread adoption of commercial cryptographic systems.
Accordingly, a critical need exists for an efficient finite field exponentiation method and apparatus to provide a sufficient level of communication security while minimizing the impact to computer system performance and data throughput rates.
Field exponentiation over an elliptic curve is heavily dependent on field multiplications. Point addition and doubling, which are the basic operations for exponentiation over an elliptic curve, require many field multiplication operations.
Field multiplication is also the basic operation used to compute field exponentiation in RSA cryptography, which is used in smart cards or other secure devices. RSA requires the rapid finite field multiplication of large integers, for example 512- or 1024-bits in length, in order to carry out finite field exponentiations.
Therefore, field multiplication is widely used in encryption/decryption, authentication, key distribution and many other applications such as those described above.
3. Prior Art of Field Multiplication:
Finite field multiplication can be described as follows. Given two elements, X and Y, that are members of a field, then X and Y, can be represented as polynomials,
                    X        =                              ∑                          i              =              0                                      d              -              1                                ⁢                                    x              i                        ⁢                          μ              i                                                          (        1        )                                Y        =                              ∑                          i              =              0                                      d              -              1                                ⁢                                    y              i                        ⁢                          μ              i                                                          (        2        )            where                (i) in case of GF(p), μ=2r is the radix and xi,yi are radix-2r digits which could also be represented in redundant form, and        (ii) in the case of GF(pm), μ=αr, α is the root of the generator polynomial, and xi and yi are digit polynomials of X and Y respectively and whose coefficients are elements of Zp.        
The field multiplication of X and Y is given by,R=XY mod G(μ)   (3)where
                              G          ⁡                      (            μ            )                          =                              ∑                          i              =              0                        d                    ⁢                                    g              i                        ⁢                          μ              i                                                          (        4        )            where                (i) in the case of GF(p), G(μ) is the modulus M and gi are radix-2r digits which could also be represented in redundant form, and        (ii) in the case of GF(pm), G(μ) is the generator polynomial, and gi is a digit polynomial of G(μ) with coefficients that are elements of Zp.        
Finite field multiplication consists of two stages: operand multiplication and field reduction. Several field multiplication algorithms have been proposed. It is well known that field reduction after operand multiplication is not as efficient as reduction during multiplication. In the later, the final result of field multiplication is obtained by applying corrections to partial products as they are generated and accumulated.
Field reduction of the generated partial products can be carried out iteratively starting form the most significant digit or the least significant digit.
Montgomery multiplication is a famous reduction-during-multiplication algorithm, which starts from the least significant position. The original Montgomery algorithm was proposed for GF(p) and is based on binary representation of the elements of the field. However, it has been generalized for other fields as well as for higher radix representation of the elements of the field.
In Montgomery multiplication, each digit of the operand X, starting with the least significant digit, is multiplied with the digits of the operand Y. The field reduction is carried from the least significant digit, since the least significant digits are generated first. The algorithm can be described using the following iteration,Rl−1=(xl−1Y+μ−1Rl−2)mod G(μ)   (5)
The final field multiplication result of Montgomery algorithm has the form,R=XYμ−d mod G(μ)   (6)
The factor μ−d is inherent in all least significant digit first algorithms, because field reduction starting from the least significant digit is equivalent to a division by μ−1 in each iteration as shown in equation (5).
Least-significant-digit-first algorithms are more natural for GF(p) due to the carry propagation from the least to most significant digits. However, these algorithms are not usually used for GF(pm) since there is no carry propagation between digit's significance. For GF(pm), most-significant-digit-first algorithms are more efficient since they do not result in any additional scaling factor such as that in equation (6), and produce the final un-scaled result at the end of the multiplication algorithm.
Least-significant-digit-first and most-significant-digit-first algorithms can be implemented using a variety of structures. Most of the existing structures are effectively based on different time-space mapping of the dataflow diagrams. The dataflow consists of nodes and edges. The nodes represent functional units and the edges represent flow of data between nodes.
It should be noted that for GF(pm), the carry edges do not exist since there is no carry propagation between digit's significance.
On one end of the spectrum, an array of processing elements is used, where each processing element performs the computation of a single node in graphs such as those shown in FIG. 1. Such realizations are usually referred to as parallel implementations.
Parallel finite-field multipliers have been reported in the U.S. Patent of Geong, U.S. Pat. No. 6,151,393 and in the published applications of Chen et al., Number 2002/0184281 and Glaser et al., Number 2003/0009503 each of which is incorporated herein in its entirety.
Parallel implementations are not efficient for large word-lengths such as those encountered in cryptography due to the huge area and high power requirements.
The other end of the spectrum is to use one processing element to implement the operation of all the nodes in the graphs such as those in FIG. 1. Such realizations are usually referred to as serial implementations. Serial implementations are not efficient for large word lengths such as those encountered in cryptography due to their long execution time.
Other variations of the basic serial finite field multiplication algorithms have also been proposed where a digit of one operand is multiplied by a block of another operand as shown in the published patent applications of Chin-Long Chen et al., Numbers 2002/0116429, 2002/0116430 and 2002/0110240 each of which is incorporated herein in its entirety. A single digit-block multiplier is used to carry out the complete multiplication in a serial fashion. Usually, the block size is larger than the digit size to speed up the computations of the basic serial multiplier. As a consequence, such multiplication algorithms require different bus-widths between the memory module and the digit-block multiplier module.
Another implementation style is mapping dataflow graphs such as those in FIG. 1 into a linear array of processing elements. The mapping can be done in one of two ways: (i) by projecting along the vertical axis to obtain the serial-parallel implementation as disclosed in an article by M. C. Mekhallalati, M. K. Ibrahim, and A. S. Ashur, entitled “Radix Modular Multiplication Algorithm” published in the Journal of Circuits and Systems, and Computers, Vol. 6, No. 5, pages 547-567, 1996. or (ii) by projecting along the horizontal-axis to obtain the serial-serial implementation as disclosed by A. F. Tenca, and C. K. Koc, in “A Scalable Architecture for Montgomery Multiplication” in Cryptographic Hardware and Embedded Systems, Lecture Notes in Computer Science No. 1717, pages 94-108, Springer Verlag, Berlin, Germany, 1999 each of which are incorporated herein in their entirety by reference.
It is well know that using more than one processing element gives a better trade-off between area and time than the parallel or serial realization.
In serial-parallel structures, the digits of one of the input operands are fed serially while the digits of the other input operand must be fed in parallel. The reason is that the digits of one of the operands are all required at every cycle.
Serial-parallel realizations have been reported in the above mentioned patents and articles.
In addition, serial-parallel structures have been reported in the U.S. Pat. No. 5,414,651 of Kassels entitled “Arithmetic Unit for Multiplying Long Integers Modulo M and R.S.A. Converter Provided With Such Multiplication Device” and U.S. Pat. No. 6,377,969 of Orlando et al. entitled “Method for Multiplication in Galois Fields Using Programmable Circuits” both of which are incorporated herein in their entirety by reference.
The advantage of the serial-parallel realization is that the first digit of the product needed to perform field reduction is obtained after an initial delay of one cycle. Defining the initial delay as the number of cycles required before the first digit of the result needed for field reduction is obtained, it is clear that the serial-parallel realization inherently has an initial delay which is independent of the word length. The drawbacks of the serial-parallel realizations are as follows:                (i) it requires parallel loading of the digits of one of the operands, since the digits of one of the operands are all required at every cycle. It is significant to note that the use of parallel transfer of data back and forth from memory and the multiplication structures is a major drawback for large word lengths such as those encountered in cryptography. The reason is that parallel loading for large wordlenghs requires large bus-widths which are costly in area and have a significant impact on the execution time.        (ii) it also has to support serial loading of the other operand and hence the hardware must support buses with different bus-widths between the memory module and the multiplier module; one being for parallel communication and one for serial communication.        
The major advantage of serial-serial implementations based on dataflow such as those in FIG. 1 is that all the operand communications are serial in nature and hence no parallel loading is needed.
It should be noted that both serial multiplication structures and serial-serial multiplication realizations require only serial communication for all operands and hence require busses with the same bus width. Serial-serial multipliers have the advantage of requiring less number of memory accesses than their serial counterparts. Also, to achieve the same execution time, the bus width of the serial multiplier needs to be higher than that needed in the serial-serial multiplier. Both of these features make serial-serial multiplier ideal for low power devices such as smart cards.
It should be noted that the serial-parallel multipliers in the aforementioned patents can be modified by loading the operand whose digits are needed at every cycle serially one digit at a time into a serial-in-parallel-out register. The drawback is that this will incur an initial delay prior to the commencement of the multiplication operation, which is dependant on the word length. The structures disclosed in the U.S. Patent of Monier, U.S. Pat. No. 5,742,534 and Walby U.S. Pat. No. 4,797,848 use a serial-in-parallel-out register. Such structures incur an initial delay prior to the commencement of the multiplication operation which is dependent on the word length.
The structures in these patents suffer from the same drawback in that the use of a serial-in-parallel-out registers will incur an initial delay prior to the commencement of the multiplication operation, which is dependant on the word length.
Serial-serial structures that do not require all the digits of all the operands in every cycle have been proposed. This can be achieved by folding the structure in FIG. 1 along the horizontal axis. One such structure was reported in the aforementioned article by A. F. Tenca and C. K. Koc. However, such structures have a major drawback in that the initial delay and clock cycle are dependant on the word length of the operands. For large word lengths such as those in cryptography, this long initial delay and clock cycle represent a significant drawback.
The authors of the structure in the aforementioned A. F. Tenca and C. K. Koc article use extra pipelining registers to make the initial delay and the clock cycle independent of the word length. It is significant to note that the use of extra pipelining registers will significantly increase the hardware and will require additional clock cycles, which are proportional to the word length.
A serial-serial finite field reduction structure is disclosed in the U.S. Pat. No. 5,101,431 of Even for “Systolic Array for Modular Multiplication”. However, as disclosed therein, the initial delay and clock cycle are dependent on the word length of the operands. For large word lengths such as those in cryptography, the long initial delay and clock cycles are a draw back. The patent also discloses the use of extra pipelining registers to make the initial delay and the clock-cycle independent of the word length. It is significant to note that the use of extra pipelining registers will significantly increase the hardware and will require additional clock-cycles, which are proportional to the word length.
Another serial-serial structure is disclosed in a published patent application of Mellott et al. Number 2002/0161810 entitled “Method and Apparatus for Multiplication and/or Modular Reduction Processing.” In this structure the initial delay and clock-cycle are also dependent on the word-length. However, to reduce the effect of word length dependency, the inventors use a multi-port adder, which may be implemented for example as a tree adder. This will reduce the effect of word length dependency, but it will not eliminate it completely.
It is also significant to note that the use of a multi-port adder will lead to an irregular structure, and hence the structure is not systolic, i.e. it is not modular, and requires irregular communication
Therefore there is a need for a realization which (i) allows serial loading of all needed operands, and (ii) has an initial delay and clock cycle which are inherently independent of the word length.
It should also be noted that any new realization must allow for scalability, which is an additional requirement for the application of finite field multiplication in cryptography. Scalable multiplication structures are those that can be reused or replicated in order to generate long-precision results independently of the data path precision for which the unit was originally designed. Scalability is needed because the key lengths could be increased for higher security. Any cryptographic hardware such as smart cards should not become obsolete with a new key length.