The present invention relates generally to the field of cryptography, and more specifically to an architecture and method for cryptography acceleration. In particular, the invention is directed to a hardware implementation to increase the speed at which authentication procedures may be performed on data packets transmitted over a computer network.
Many methods to perform cryptography are well known in the art and are discussed, for example, in Applied Cryptography, Bruce Schneier, John Wiley & Sons, Inc. (1996, 2nd Edition), herein incorporated by reference. In order to improve the speed of cryptography processing, specialized cryptography accelerator chips have been developed. Cryptography accelerator chips may be included in routers or gateways, for example, in order to provide automatic IP packet encryption/decryption. By embedding cryptography functionality in network hardware, both system performance and data security are enhanced. 
Cryptography protocols typically incorporate both encryption/decryption and authentication functionalities. Encryption/decryption relates to enciphering and deciphering data, authentication is concerned with data integrity, including confirming the identity of the transmitting party and ensuring that a data packet has not been tampered with en route to the recipient. It is known that by incorporating both encryption and authentication functionalities in a single accelerator chip, over-all system performance can be enhanced.
Examples of cryptography protocols which incorporate encryption/decryption and authentication functionalities include SSL (Netscape Communications Corporation), commonly used in electronic commerce transactions, and the more recently promulgated industry security standard known as “IPSec.” These protocols and their associated algorithms are well known in the cryptography art and are described in detail in National Institute of Standards and Technology (NIST), IETF and other specifications, some of which are identified (for example, by IETF RFC#) below for convenience. These specifications are incorporated herein by reference for all purposes.
SSL (v3) uses a variant of HMAC (RFC2104) for authentication. The underlying hash algorithm can be either MD5 (RFC1321) and SHA1 (NIST). In addition, the key generation algorithm in SSL also relies on a sequence of MD5 and SHA1 operations. SSL deploys algorithms such as RC4, DES, triple DES for encryption/decryption operations.
The IP layer security standard protocol, IPSec (RFC2406) specifies two standard algorithms for performing authentication operations, HMAC-MD5-96 (RFC2403) and HMAC-SHA1-96 (RFC2404). These algorithms are based on the underlying MD5 and SHA1 algorithms, respectively. The goal of the authentication computation is to generate a unique digital representation, called a digest, for the input data. 
Both MD5 and SHA1 specify that data is to be processed in 512-bit blocks. If the data in a packet to be processed is not of a multiple of 512 bits, padding is applied to round up the data length to a multiple of 512 bits. Thus, if a data packet that is received by a chip for an authentication is larger then 512 bits, the packet is broken into 512-bits data blocks for authentication processing. If the packet is not a multiple of 512 bits, the data left over following splitting of the packet into complete 512-bit blocks must be padded in order to reach the 512-bit block processing size. The same is true if a packet contains fewer then 512 bits of data. For reference, a typical Ethernet packet is up to 1,500 bytes. When such a packet gets split into 512-bit blocks, only the last block gets padded and so that overall a relatively small percentage of padding overhead is required. However for shorter packets, the padding overhead can be much higher. For example, if a packet has just over 512 bits it will need to be divided into two 512-bit blocks, the second of which is mostly padding so that padding overhead approaches 50% of the process data. The authentication of such short data packets is particularly burdensome and time consuming using the conventionally implemented MD5 and SHA1 authentication algorithms.
For each 512-bit data block, a set of operations including non-linear functions, shift functions and additions, called a “round,” is applied to the block repeatedly. MD5 and SHA1 specify 64 rounds and 80 rounds, respectively, based on different non-linear and shift functions, as well as different operating sequences. In every round, the operation starts with certain hash states (referred to as “context”) held by hash state registers (in hardware) or variables (in software), and ends with a new set of hash states (i.e., an initial “set” of hash states and an end set; a “set” may be of 4 or 5 for the number of registers used by MD5 and SHA1, respectively). MD5 and SHA1 each specify a set of constants as the initial hash states  for the first 512-bit block. The following blocks use initial hash states resulting from additions of the initial hash states and the ending hash states of the previous blocks.
Typically, MD5 and SHA1 rounds are translated into clock cycles in hardware implementations. The addition of the hash states, to the extent that they cannot be performed in parallel with other round operations, requires overhead clock cycles in the whole computation. The computation of the padded portion of the data is also generally considered performance overhead because it is not part of the true data. Accordingly, the performance of MD5 and SHA1 degrade the most when the length of the padding is about the same as the length of the data (e.g., as described above, when a packet has just fewer than 512 bits of data and the padding logic requires an extra 512-bit to be added for holding the pad values).
Moreover, the HMAC-MD5-96 and HMAC-SHA1-96 algorithms used in IPSec expand MD5 and SHA1, respectively, by performing two loops of operations. The HMAC algorithm for either MD5 or SHA1 (HMAC-x algorithm) is depicted in FIG. 1. The inner hash (inner loop) and the outer hash (outer loop) use different initial hash states. The outer hash is used to compute a digest based on the result of the inner hash. Since the result of inner hash is 128 bits long for MD5 and 160 bits long for SHA1, the result must always be padded up to 512 bits and the outer hash only processes the one 512-bit block of data. HMAC-MD5-96 and HMAC-SHA1-96 provide a higher level of security, however additional time is needed to perform the outer hash operation. This additional time becomes significant when the length of the data to be processed is short, in which case, the time required to perform the outer hash operation is comparable to the time required to perform the inner hash operation. 
Authentication represents a significant proportion of the time required to complete cryptography operations in the application of cryptography protocols incorporating both encryption/decryption and MD5 and/or SHA1 authentication functionalities. In the case of IPSec, authentication is often the time limiting step, particularly for the processing or short packets, and thus creates a data processing bottleneck. Accordingly, techniques to accelerate authentication and relieve this bottleneck would be desirable. Further, accelerated implementations of multi-round authentication algorithms would benefit any application of these authentication algorithms. 