1. Field of the Invention
The present invention is generally related to apparatus and method for implementing encryption and decryption using the AES (advanced encryption standard) algorithm, more particularly, to a technique for reducing hardware used for encrypting and decrypting using the AES algorithm.
2. Description of the Related Art
Inexpensive high-speed Internet access technologies, including optical fiber networks, ADSL (asymmetric digital subscriber line) networks, and cable television networks and so on, promote use of VPN (virtual private network) technologies, which provide secure communications through public communication networks. The use of VPN eliminates the necessity of expensive private links, and preferably reduces communications cost.
Typical VPN technologies adopt the US government's data encryption standard (DES) algorithm, which uses 56-bit common keys, or the Triple-DES algorithm, which uses three passes of the DES algorithms. Nevertheless, these algorithms do not satisfy recent requirements; the DES seems to be no longer sufficient to provide the necessary security, while the Triple-DES requires a large amount of processing.
The AES algorithm, which is based on the Rijndael algorithm, is a promising candidate of the next-generation encryption algorithm for VPN. The security of the AES algorithm is at least as good as the Triple-DES, and superior to it in efficiency. This situation necessitates encryption and decryption platforms adapted to the AES algorithm, including AES-dedicated hardware and software.
Federal Information Processing Standards Publication 197, hereinafter referred to as FIPS 197, the entire disclosure of which is incorporated herein by reference, presents the procedure of implementing the AES algorithm.
The input for the AES algorithm consists of sequences of 128 bits, which are referred to as blocks. The AES algorithm divides the 128-bit input into 16 bytes (each consisting of 8 bits), and arranges the 16 bytes to generate a two-dimensional array of bytes called the state. The state consists of four rows and four columns of bytes. The AES algorithm's operations are performed on states. The input, which is the array of bites in0, in1 . . . in15, is copied into the state Array as illustrated in FIG. 1. The (i, j) element of the state is denoted by Si, j for 0≦i, j≦3, hereinafter.
FIG. 2 is a flowchart illustrating the encrypting procedure using the AES algorithm, which is described in the pseudo code in FIPS 197.
For encryption, the AES algorithm involves repeatedly implementing a set of transformations called “round”. The number of the rounds, which is denoted by Nr, depends on the key lengths. The numbers of rounds for the key lengths of 128, 192, and 256 are 10, 12, and 14, respectively.
Each round is composed of four transformations called “SubBytes”, “ShiftRows”, “MixColumns”, and “AddRoundKey”, which are denoted by numerals 1404, 1405, 1406, and 1407, respectively, with exception of the final round 1408, which does not include the MixColumns transformation.
Encryption using the AES algorithm begins with an initial AddRoundKey transformation 1402. After the initial AddRoundKey transformation 1402, the first Nr−1 rounds 1403 are implemented repeatedly, which is followed by the final round 1408.
The following is a brief explanation of the aforementioned four transformations “SubBytes”, “ShiftRows”, “MixColumns”, and “AddRoundKey”. Details of these transformations are given in FIPS 197.
The SubBytes transformation 1404 is a byte substitution on each byte of the state using a substitution table called “S-Box”. The S-Box, whose contents are disclosed in FIG. 7 of FIPS 197, is constructed by composing two transformations: taking the multiplicative inverse in the Galois field GF(28), and applying an affine transformation over the GF(28).
In the ShiftRows transformation 1405, the bytes of the last three rows of the state are shifted over different numbers of bytes.
The MixColumns transformation 1406 operates on the state column-by-column, treating each column as a four-term polynomial. The columns are considered as polynomials over GF(28) and multiplied with a fixed polynomial.
In the AddRoundKey transformation, a Round Key, which is generated through a key expansion of a common key, is added to the state by a simple bitwise XOR operation.
It should be noted that addition and multiplication in the MixColumns and AddRoundKey transformations are implemented over the Galois field GF(28). Adders for implementing addition over GF(28), which computes the sum of two GF elements by XORing the corresponding bits, only requires reduced hardware typically including several logic gates. In contrast, multipliers for implementing multiplication over GF(28) requires increased hardware, typically including several tens of logic gates.
The SubBytes, ShiftRows, and MixColumns transformations 1404, 1405, and 1406, which are the components of the round, are often performed collectively to improve the throughput. There is a need for providing a method for efficiently performing these transformations for improving efficiency because of the following reasons. The SubBytes and MixColumns transformations require a large amount of processing because the SubBytes transformation includes many table lookups, and the MixColumns transformation includes multiplication over the Galois field GF(28). In addition, the ShiftRows and SubBytes transformations can be performed collectively, which desirably improves efficiency. It should be noted that the AddRoundKey transformation is usually performed independently, because of its high simplicity and independence.
FIG. 3 is a signal flow diagram illustrating a conventional method of implementing the SubBytes, ShiftRows, and MixColumns transformations collectively for obtaining the first column of the transformation result, the first column including four elements. The first column of the transformation result is obtained from the state elements S0,0, S1,1, S2,2, and S3,3.
Although FIG. 2 illustrates that a round begins with the SubBytes transformation, the method begins with the ShiftRows transformation. It should be noted that the SubBytes, and ShiftRows transformations commute, and thus the same result is obtained regardless of the order in which the SubBytes and ShiftRows transformations are performed.
The ShiftRows transformation is implemented by obtaining the associated elements using table lookups from the state. The MixColumns transformation is achieved by multiplication of the substitution values obtained from the S-box with corresponding coefficients followed by addition 1503, the coefficients being defined as disclosed in FIFP 197 (see formula (5.6)). In FIG. 3, the multiplication over GF(28) in hexadecimal notation is denoted by symbols “·{xy}” where xy is a hexadecimal value.
The same goes for the reminder columns of the transformation result. The second column is obtained from the elements S0,1, S1,2, S2,3, and S3,0, the third column is obtained from the elements S0,2, S1,3, S2,0, and S3,1, and the fourth column is obtained from the elements S0,3, S1,0, S2,1, and S3,2.
In order to improve the throughput, conventional hardware for implementing AES rounds is often provided with a plurality of S-boxes to perform parallel processing. For example, Seike et al. discloses a Rijndael processor performing parallel processing of all the 16 elements of the state using 16 S-Boxes having the same content in “Trial produce of the AES cryptography using FPGA,” p. 13, Technical Report of IEICE, VLD2001-91, ICD2001-136, PTS2001-38, November 2001. Schaumont et al. discloses a similar Rijndael processor having 16 S-Boxes in “Unlocking the design secrets of a 2.29 Gb/s Rijndael processor,” Design Automation Conference, 2002. Proceedings. 39th, 2002, pp. 634-639.
McLoone et al. discloses a look-up table based Rijndael design to achieve an improved speed in Signal Processing Systems, 2001 IEEE Workshop on, 2001, pp. 349-360. The design implements not only the SubBytes transformation but also the ShiftRows and MixColumns transformations as look-up tables (LUTs). FIG. 4 is a signal flow diagram illustrating the procedure of implementing the SubBytes, ShiftRows, and MixColumns transformations for the first column of the transformation result. The design includes additional two further LUTs in place of the Galois field multipliers 1502; one containing the values of the SubBytes LUT multiplied in GF(28) by the hexadecimal number “02”, the other containing the values of the SubBytes LUT multiplied in GF(28) by the hexadecimal number “03”. These additional LUTs are used to perform parallel processing.
The round of the AES algorithm may be implemented by software. Gladman discloses a source code for implementing the AES algorithm in “Implementations of AES (Rijndael) in C/C++ and Assembler,” http://fp.gladman.plus.com/cryptography_technology/rijndael.
FIG. 5 illustrates the Gladman's method for implementing the round for the AES algorithm. The method involves preparing “expanded” S-Boxes #0 to #3, which consist of 256 32-bit words, in a main memory. Each 32-bit word of the expanded S-Boxes #0 contains bits #0 to #31, wherein the bits #0 to #7 contains the corresponding value of the SubBytes S-box multiplied by the hexadecimal number “02”, the bits #8 to #15 contains the corresponding value of the SubBytes S-box multiplied by the hexadecimal number “01”, the bits #16 to #23 contains the corresponding value of the SubBytes S-box multiplied by the hexadecimal number “01”, and the bits #24 to #31 contains the corresponding value of the SubBytes S-box multiplied by the hexadecimal number “02”. Correspondingly, each 32-bit word of the expanded S-Boxes #1 contains sequences of bits #0 to #7, #8 to #15, #16 to #23, and #24 to 31 which sequences respectively contain the corresponding values of the SubBytes S-box multiplied by the hexadecimal number “03”, “02”, “01”, and “01”, each 32-bit word of the expanded S-Boxes #2 contains sequences of bits #0 to #7, #8 to #15, #16 to #23, and #24 to 31 which sequences respectively contain the corresponding values of the SubBytes S-box multiplied by the hexadecimal number “01”, “03”, “02”, and “01”, and each 32-bit word of the expanded S-Boxes #3 contains sequences of bits #0 to #7, #8 to #15, #16 to #23, and #24 to 31 which sequences respectively contain the corresponding values of the SubBytes S-box multiplied by the hexadecimal number “01”, “01”, “03”, and “02”.
The “expanded” S-Boxes #0 to #3 enables SIMD (single instruction multiple data) processing for implementing the SubBytes, ShiftRows, and MixColumns transformations with reduced amount of processing, which only includes four table lookups to the expanded S-Boxes and four additions 1703 in GF(28). This allows the Gladman's method to achieve an improved speed.
The aforementioned cipher transformations can be inverted and then implemented in reverse order to achieve description for the AES algorithm.
FIG. 6 is a flowchart of implementing decryption according to the AES algorithm. The decryption begins with an initial AddRoundKey transformation 1802. After the initial AddRoundKey transformation 1802, first to (Nr−1)-th rounds 1803 are implemented, and followed by a final round 1808. The rounds includes InvSubBytes, InvShiftRows, InvMixColumns, and AddRoundKey′ transformations 1804, 1805, 1806, and 1807, with exception of the final round 1408, which does not include the MixColumns transformation, where the InvSubBytes, InvShiftRows, InvMixColumns, and AddRoundKey′ transformations 1804, 1805, 1806, and 1807 are the inverses of the SubBytes, ShiftRows, MixColumns, and AddRoundKey transformations 1404, 1405, 1406, and 1407, respectively. It should be noted that the AddRoundKey transformation is its inverse itself; however, the prime is attached to distinguish the inverse from the AddRoundKey transformation.
It should be noted that the inverse transformations are not implemented in the reverse order to the cipher transformations; the order of the inverse transformations are optimized to improve efficiency.
First, the InvShiftRows and InvSubBytes transformations 1805 and 1804 are permutated. This permutation is effective for improving the processing speed with the transformation result unchanged. It should be noted that the InvShiftRows and InvSubBytes transformations commute.
Second, the AddRoundKey′ and InvMixColumns transformations 1807, and 1806 are permutated. In the reverse order of the cipher transformation, the InvMixColumns transformation 1806 would operate on the result of the AddRoundKey′ transformation 1807; however, the order is modified to improve efficiency. It should be noted that the permutation of the AddRoundKey′ and InvMixColumns transformations requires that expanded keys going through the InvMixColumns transformations 1807 be used for the AddRoundKey′ transformation 1807 in place of the original expanded keys.
As illustrated in FIG. 4 and FIG. 6, rounds for AES-based encryption are almost similar to those for decryption; the difference are the contents of the S-boxes used for the SubBytes and InvSubBytes transformations 1404, and 1804, and the coefficients used for the MixColumns and InvMixColumns transformations 1406 and 1806. In the InvSubBytes transformations 1804, the inverse S-box, which is defined as illustrated in FIG. 14 of FIPS 197, while the InvMixColumns transformations 1806 use the coefficients described in formula (5.10) of FIPS 197.
One of the drawbacks of the conventional encryption and decryption architectures is that it requires increased hardware for improving processing speed. The conventional architectures use a plurality of lookup tables to achieve parallel processing; however, the increase in the lookup tables causes an undesirable increase in hardware. Only a single S-box containing 256 8-bit words requires several thousands of logic gates.
Therefore, there is a need for providing apparatus and method for implementing the AES algorithm with reduced hardware and sufficient throughput.
3. List of Other Prior Art Documents
The following is a description of conventional Galois field processors; a m-bit multiplier module for multiplication over a Galois field GF(2m) is disclosed in Japanese Open Laid Patent Application No. Jp-A 2002-23999. Galois field processors for computing the multiplicative inverse in GF(28) are disclosed in Japanese Open Laid Patent Applications No. Jp-A 2000-322280A, and Jp-A-Heisei 11-249921. An error correction circuit including Galois field processors, each of which has a Galois field adder and multiplier, is disclosed in Japanese Open Laid Patent Application No. Jp-A-Showa 63-186338.