The present invention relates in general to security processing systems and, more specifically, to an apparatus and method for hash processing using multiple hash storage areas for reading and writing data during hash processing.
Hash functions have been widely used in modern cryptography to produce compressed data, message digests, fingerprints, and checksums, among other things. A hash function is a mathematical function that takes a variable-length input string, and converts it to a fixed-length output string. The output string is called a hash value, which typically is smaller than the input string. A “one-way” hash function is a hash function that works in one direction, meaning that it is easy to compute a hash value from an input string, but it is difficult to generate a second input string that hashes to the same value. Bruce Schneier, Applied Cryptography, at 429–59 (1996) includes a detailed discussion of various one-way hash algorithms.
In most modern security applications that implement hashing, the hash algorithms used are the SHA1 algorithm as defined in FIPS PUB 180-1, the MD5 algorithm as defined in RFC 1321, and HMAC-SHA1 and HMAC-MD5 as defined in RFC 2104, all of the foregoing which are incorporated by reference in full herein. These algorithms compute a signature or message digest of a sequence of bytes.
The MD5 and SHA1 hashing algorithms each require a temporary working memory of at least sixteen 32-bit words. The algorithms operate on an input data stream in blocks of 64 bytes. If the input data stream is not a multiple of 64 bytes, such as may occur when processing the last portion of data for a data packet, the algorithms define a procedure for implicit padding.
Typically, the temporary working memory is filled with 64-byte blocks of the input data stream. If the last block of input data for a data packet is less than 64 bytes, then the temporary working memory is filled with implicit padding as defined by the algorithms.
SHA1 Algorithm
As mentioned above, a commonly used, one-way hash algorithm is the “Secure Hash Algorithm,” or “SHA1,” which was developed by the National Institute of Standards and Technology (NIST) and the National Security Agency (NSA). SHA1 is described in detail in the Federal Information Processing Standards Publication 180-1 (May 11, 1993) (FIPS PUB 180-1), issued by NIST.
The federal government requires SHA1 to be used with their standardized “Digital Signature Algorithm” (DSA), which computes a signature for the message from a message digest. In addition, the federal government requires SHA1 to be used whenever a secure hash algorithm is required for a federal application, and encourages its use by private and commercial organizations. Accordingly, the use of SHA1 has become extremely common for applications that need a one-way hash algorithm.
When an input message of any length <264 bits is input into SHA1, the algorithm produces a 160-bit output called a “message digest.” SHA1 sequentially processes message blocks of 512 bits when computing a message digest. If a message is not a multiple of 512 bits, then SHA1 first pads the message to make the message a multiple of 512 bits. The padded message is then processed by SHA1 as n 512-bit blocks, M1, . . . , Mn, where each block is composed of sixteen 32-bit words, L0, L1, . . . , L15.
The message digest computation uses two buffers, each consisting of five 32-bit registers, and a sequence of eighty 32-bit words. The registers of the first 5-word buffer are labeled ha, hb, hc, hd, and he, and the registers of the second 5-word buffer are labeled, h0, h1, h2, h3, h4. The words of the 80-word sequence are derived from the sixteen 32-bit words in the message block, and are labeled W0, W1, . . . , W79. A single word register, TEMP, is also employed.
One “round,” t, is performed during each iteration of SHA1, where a round is defined as a calculation that operates on one word, Wt, of the 80-word sequence, referred to as the “input sequence.” Accordingly, the processing of each block involves eighty iterations. Because each iteration takes one clock cycle, the processing of each block uses eighty clock cycles.
During the eighty iterations, SHA1 uses a sequence of eighty non-linear functions (NLF), f0, f1, . . . , f79. Each function, ft, 0<=t<=79, operates on three 32-bit words, and produces a 32-bit word as output. SHA1 also uses a sequence of constant words, K0, . . . , K79, during the eighty iterations. ft (X, Y, Z) is defined as follows:
ft(X,Y,Z)=(X AND Y) OR ((NOT X) AND Z) (0<=t<=19)
ft(X,Y,Z)=X XOR Y XOR Z (20<=t<=39)
ft(X,Y,Z)=(X AND Y) OR (X AND Z) OR (Y AND Z) (40<=t<=59)
ft(X,Y,Z)=X XOR Y XOR Z (60<=t<=79).
The algorithm also uses a sequence of constant words, K0, . . . , K79. These constants are the same as the constants used in SHA1. In hex, these are given by:
Kt=5A827999 (0<=t<=19)
Kt=6ED9EBA1 (20<=t<=39)
Kt=8F1BBCDC (40<=t<=59)
Kt=CA62C1D6 (60<=t<=79)
To generate the message digest, first the h0, h1, h2, h3, h4 registers are initialized to a predetermined set of initialization values. Specifically, registers h0, h1, h2, h3, and h4 are initialized to the following values, in hex:
h0=67452301
h1=EFCDAB89
h2=98BADCFE
h3=10325476
h4=C3D2E1F0.
The creation of the message digest then involves the following operations, where each of the blocks, M1, M2, . . . , Mn are processed in order:                1) Divide Mx into sixteen 32-bit words, L0, L1, . . . , L15, where L0 is the leftmost word, and Mx is the next message block to be processed.        2) Let register ha=h0, hb=h1, hc=h2, hd=h3, and he=h4        3) For t=0 to 15, let Wt=Lt; and                    For t=16 to 79, let Wt=S1 (Wt-3 XOR Wt-8 XOR Wt-14 XOR Wt-16),            where SX indicates a left circular shift by X bits.                        4) Fort=0 to 79,                    TEMP=S5(ha)+ft(hb,hc,hd)+he+Wt+Kt;            ha=TEMP; hb=ha; hc=S30(hb); hd=hc; he=hd                        5) Let h0=h0+ha; h1=h1+hb; h2=h2+hc; h3=h3+hd, h4=h4+he                    Repeat steps 1–5 for the next block.After processing the last block, Mn, the message digest is the 160-bit string represented by the five words h0, h1, h2, h3, h4.                        
In many cases, the SHA1 algorithm is performed within an application specific integrated circuit (ASIC), where the operations are performed using hardware-implemented logic gates. A hardware implementation of the SHA1 algorithm requires five registers for the 32-bit digest variables h0, h1, h2, h3, h4, which are initialized at start to constant values. It also uses registers for temporary cycle variables ha, hb, hc, hd, he, which have their initial value loaded from the five registers for h0, h1, h2, h3, h4 respectively. There are 80 rounds of hashing operation which changes the ha, hb, hc, hd, he register values. Finally, after 80 rounds, the h0, h1, h2, h3, h4 variables are incremented by ha, hb, hc, hd, he, respectively. In each round of SHA1 operation, the data is read from and written to the temporary working memory. Typically, in prior implementations, each hash operation over 64 bytes takes 80 clocks for each round of SHA1.
MD5 Algorithm
As mentioned above, a commonly used, one-way hash algorithm is “MD5”, where MD stands for “message digest.” MD5 was developed by Ron L. Rivest, and described in his paper entitled “The MD5 Message Digest Algorithm,” RFC 1321 (April 1992).
When an arbitrarily large input message is input into MD5, the algorithm produces a 128-bit output called a “fingerprint” or “message digest” of the input message. MD5 sequentially processes message blocks of 512 bits when computing a message digest. If a message is not a multiple of 512 bits, then MD5 first pads the message to make the message a multiple of 512 bits. The padded message is then processed by MD5 as n 512-bit blocks, M1, . . . , Mn, where each block is composed of sixteen 32-bit sub-blocks, Wj, 0<=j<=15. The main loop of MD5 processes each 512-bit block one at a time, and continues for as many 512-bit blocks as are in the message. The output of the algorithm is a set of four 32-bit words, which concatenate to form a single 128-bit message digest. A four-word temporary buffer (ha, hb, hc, hd) is used to compute the message digest in four so-called rounds of computation, where each of ha, hb, hc, and hd is a 32-bit register. A four-word digest buffer (h0, h1, h2, h3) is used to accumulate the results from each round, and registers h0, h1, h2, and h3 are initialized to particular values as defined in the MD5 algorithm.
The main loop of MD5 has four “rounds,” where each round includes sixteen operations. Accordingly, sixty-four operations, i (0<=i<=63), are performed for each message block.
During each operation, a non-linear function (NLF) is performed on three of four 32-bit variables stored in ha, hb, hc, and hd. Then, the operation adds the NLF output to the fourth variable, a sub-block, Mj, of the message, and a constant word, ti. The operation then performs a left circular shift of a variable number of bits, si, and adds the result to the contents of one of ha, hb, hc or hd. Finally, that sum replaces the contents of one of ha, hb, hc or hd, and the next operation is performed. The NLF used for the operations in each round (i.e., each set of 16 sequential operations) is different from the NLF used in the previous round.
After the fourth round, ha, hb, hc, and hd are added to h0, h1, h2, and h3, respectively, and the main loop repeats for the next message block, until the last block, Mn, has been processed. After processing the last block, the message digest is the 128 bit string represented by the concatenated words stored in h0, h1, h2, and h3.
MD5 can be performed by software, or within an application specific integrated circuit (ASIC), where the operations are performed using hardware-implemented logic gates. During one operation, a non-linear function (NLFi) is applied to three of the variables stored in registers ha, hb, hc, and hd. The three variables input into the NLF are the variables stored in hb, hc, and hd, although the input variables could differ for other rounds. The result is added, by a first full adder, to the contents of register ha. A second full adder adds the output of the first full adder to the appropriate sub-block, Wj, for the round and operation being performed. A third full adder then adds the output of the second full adder to the appropriate constant word, ti, for the round and operation being performed.
A shifter then circularly left shifts the output of the third full adder by the appropriate number of bits, si, for the round and operation being performed. Finally, the contents of register hb is added, by a fourth full adder, to the output of the shifter. The output of the fourth full adder is then added to the contents of register hb, and that sum is placed in register ha, for use during the next operation. The next operation will then use a different message sub-block, Wj, constant word, ti, and number of shifts, si, in the left circular shift operation, as well as a different set of three variables to be operated on by the NLF. In addition, the next operation may (or may not) use a different NLF.
During the four rounds associated with one message block, the logic blocks are cycled through sixty-four times. Further, the total number of cycles through the logic is 64n, where n is the number of 512-bit blocks in the message. Each cycle through the logic corresponds to one clock cycle. The clock frequency is limited by the various delays associated with the gates and other logical components. The logic depth of the operation is rather substantial, because the logic includes computationally complex full adders, among other elements. The cumulative delay associated with this design is long, and consequently the clock frequency must be fairly low.
Now describing the MD5 algorithm in more detail and as mentioned above, the four-word digest buffer (h0, h1, h2, h3) is used to compute the message digest, where each of h0, h1, h2, and h3 is a 32-bit register. These registers are initialized to particular values, which are the same initialization values as are used in the standard MD5 implementation.
As described previously, the main loop of MD5 has four rounds, t (0<=t<=3), where each round includes sixteen operations. Accordingly, sixty-four operations, i (0<=i<=63), are performed for each message block.
During each operation, a non-linear function (NLF) is performed on three of four 32-bit variables stored in ha, hb, hc, and hd. Then, the operation adds the NLF output to the fourth variable, a sub-block, Mj, of the message, and a constant word, ti. The operation then performs a left circular shift of a variable number of bits, si, and adds the result to the contents of one of ha, hb, hc or hd. Finally, that sum replaces the contents of hb. The other registers are updated as ha=hd; hd=hc; hc=hb; and hb=sum.
The NLF used for the operations in each round (i.e., each set of 16 sequential operations) is different from the NLF used in the previous round. Each NLF takes as input three 32-bit words and produces as output one 32-bit word. The four NLFs are defined as follows, and are the same as the NLFs used in the standard MD5 implementation:
F(X,Y,Z)=(X AND Y) OR ((NOT X) AND Z) (for round 1: 0<=i<=15)
G(X,Y,Z)=(X AND Z) OR (Y AND (NOT Z)) (for round 2: 16<=i<=31)
H(X,Y,Z)=X XOR Y XOR Z (for round 3: 32<=i<=47)
I(X,Y,Z)=Y XOR (X OR (NOT Z)) (for round 4: 48<=i<=63).
The main loop of the MD5 algorithm is performed as described below. First, the values in the four registers of the buffer (h0, h1, h2, h3) are copied into four 32-bit variables ha, hb, hc, and hd, so that ha=h0, hb=h1, hc=h2, and hd=h3.
Each of the four rounds is then performed by applying the following logic, which is the same logic as is used in the standard MD5 implementation. In the functions below, Wj represents the jth sub-block of the message (0<=j<=15), <<<s represents a left circular shift of s bits, and “+” denotes the addition of words.                Round 1: For i=0 to 15,                    FF(ha,hb,hc,hd,Wj,s,ti) denotes the operation            ha=hb+((ha+F(hb,hc,hd)+Wj+ti)<<<s).                        Round 2: For i=16 to 31,                    GG(ha,hb,hc,hd,Wj,s,ti) denotes the operation            ha=hb+((ha+G(hb,hc,hd)+Wj+ti)<<<s).                        Round 3: For i=32 to 47,                    HH(ha,hb,hc,hd,Wj,s,ti) denotes the operation                        
ha=hb+((ha+H(hb,hc,hd)+Wj+ti)<<<s).                Round 4: For i=48 to 63,                    II(ha,hb,hc,hd,Wj,s,ti) denotes the operation                        
ha=hb+((ha+I(hb,hc,hd)+Wj+ti)<<<s).
During each round, the three variables operated upon by the NLF, the message sub-block, Wj, the constant word, ti, and the number of shifts, si, in the left circular shift operation change from operation to operation. For each round and operation, these operations are performed sequentially, where the operations are the 64 operations as described in the standard MD5 implementation.
After Round 4, ha, hb, hc, and hd are added to the then current contents of h0, h1, h2, and h3, respectively. The main loop then repeats for the next message block, until the last block, Mn, has been processed. After processing the last block, the message digest is the 128-bit string represented by the concatenated words stored in h0, h1, h2, and h3.
As mentioned above, the MD5 algorithm requires four 32-bit digest variables h0, h1, h2, h3, which are initialized at start to constant values. It also uses temporary cycle variables ha, hb, hc, hd, which load their initial value from h0, h1, h2, h3 respectively. There are 64 rounds of hashing operation, which changes the ha, hb, hc, hd values during processing. Finally, after 64 rounds, the h0, h1, h2, h3 variables are incremented by ha, hb, hc, hd respectively. In each round of MD5 operation, only read operations are performed on the temporary working memory. Typically, each hash operation over 64 bytes takes 64 clocks for each round of MD5.
Temporary Working Memory Usage
Typically, for each hash operation over 64 bytes, including the SHA1 or MD5 operations described above, the hash blocks are idle while the temporary working memory is filled again with the next 64 bytes of input data. This filling operation may typically take 16 clocks to write 16 words of 32-bit data. These idle cycles reduce the ideal bandwidth of a hash circuit or block. It would be desirable to have a hash block that substantially eliminates this loss of ideal bandwidth. Since the hash operation takes several clock cycles to complete, generally the next 64 bytes of input data stream is accumulated in a buffer to load to the temporary working memory as soon as the hash operation is complete.
In standard IPSec and SSL/TLS applications, both encryption and hashing operations are performed. For shorter data packets, the hashing operations typically require an order of magnitude greater number of clock cycles than the corresponding ciphering operations. If several hash blocks are used for the same ciphering channel, then more buffering and working memory resources are required for the hashing operations because of the use of separate buffers and working memory. It would be desirable to have a hash block that implements two (or more) hash channels for each ciphering channel while reducing the size of the required buffering and working memory resources.
Implicit Padding
As mentioned above, implicit padding is defined for both the SHA1 and MD5 algorithms. Prior hash circuits typically perform this padding after a block of data has been loaded into the temporary working memory. This padding adds additional clock cycles of processing time for each 64-byte block of data to be hashed. It would be desirable to avoid the idle clock cycles required for loading the temporary working memory with padding bytes so that bandwidth through the hash block could be increased.
Need for Improved Hash Processing System
As the desire to compute data increases, communication systems increasingly place more demands on the computation speed of cryptographic algorithms. Thus, there is a need for an improved hash processing system that handles SHA1 and MD5 hash operations in fewer effective clock cycles, that makes improved usage of buffering and working memory resources, and that reduces the time dedicated to performing padding prior to hashing of data.