Before the advent of the Internet, corporate data networks typically consisted of dedicated telecommunications lines leased from a public telephone company. Since the hardware implementation of the data networks was the exclusive property of the telephone company, a regulated utility having an absolute monopoly on the medium, security was not much of a problem; the single provider was contractually obligated to be secure, and the lack of access to the switching network from outside made it more or less resistant to external hacking and tampering.
Today, more and more enterprises are discovering the value of the Internet which is currently more widely deployed than any other single computer network in the world and is therefore readily available for use by a multinational corporate network. Since it is also a consumer-level product, Internet access can usually be provided at much lower cost than the same service provided by dedicated telephone company network. Finally, the availability of the Internet to the end user makes it possible for individuals to easily access the corporate network from home, or other remote locations.
The Internet however, is run by public companies, using open protocols, and in-band routing and control that is open to scrutiny. This environment makes it a fertile proving ground for hackers. Industrial espionage is a lucrative business today, and companies that do business on the Internet leave themselves open to attack unless they take precautions.
Several standards exist today for privacy and strong authentication on the Internet. Privacy is accomplished through encryption/decryption. Typically, encryption/decryption is performed based on algorithms which are intended to allow data transfer over an open channel between parties while maintaining the privacy of the message contents. This is accomplished by encrypting the data using an encryption key by the sender and decrypting it using a decryption key by the receiver (sometimes, the encryption and decryption keys are the same).
Types of Encryption Algorithms
Encryption algorithms can be classified into public-key and secret key algorithms. In secret-key algorithms, both keys are secret whereas in public-key algorithms, one of the keys is known to the general public. Block ciphers are representative of the secret-key cryptosystems in use today. Usually, for block ciphers, the encryption key is the same as the decryption key. A block cipher takes a block of data, typically 32-128 bits, as input and produces the same number of bits as output. The encryption and decryption are performed using a key, anywhere from 56-128 bits in length. The encryption algorithm is designed such that it is very difficult to decrypt a message without knowing the key.
In addition to block ciphers, Internet security protocols also make heavy use of public-key algorithms. A public key cryptosystem such as the Rivest, Shamir, Adelman (RSA) cryptosystem described in U.S. Pat. No. 5,144,667 issued to Pogue and Rivest uses two keys, only one of which is made public. Once someone publishes a key, anyone may send that person a secret message using that key. However, decryption of the message can only be accomplished by use of the secret key. The advantage of such public-key encryption is secret keys do not need to be distributed to all parties of a conversation beforehand. In contrast, if only secret-key encryption were used, multiple secret keys would have to be generated, one for each party intended to receive the message, and each secret key would have to be privately communicated. Attempting to communicate the secret key privately results in the same problem as in sending the message itself using only secret-key encryption; this is called the key distribution problem.
Key exchange is another application of public-key techniques. In a key exchange protocol, two parties can agree on a secret key even if their conversation is intercepted by a third party. The Diffie-Hellman exponential key exchange, described in U.S. Pat. No. 4,200,770, is an example of such a protocol.
Most public-key algorithms, such as RSA and Diffie-Hellman key exchange, are based on modular exponentiation, which is the computation of xcex1x mod p. This expression means xe2x80x9cmultiply xcex1 by itself X times, divide the answer by p, and take the remainder.xe2x80x9d This computation is very expensive to perform, for the following reason. In order to perform this operation, many repeated multiplications and divisions are required, although techniques such as Montgomery""s method, described in xe2x80x9cModular Multiplication Without Trial Division,xe2x80x9d from Mathematics of Computation, Vol. 44, No. 170 of April 1985, can reduce the number of divisions required. In addition, the numbers used are very large (typically 1024 bits or more), so the multiply and divide instructions found in common CPUs cannot be used directly. Instead, special algorithms that break down the large multiplications and divisions into operations small enough to be performed on a CPU must be used. These algorithms usually have a run time proportional to the square of the number of machine words involved. These factors result in multiplication of large numbers being a very slow operation. For example, a Pentium(copyright) can perform a 32xc3x9732-bit multiply in 10 clock cycles. A 2048-bit number can be represented in 64 32-bit words. A 2048xc3x972048-bit multiply requires 64xc3x9764 separate 32xc3x9732-bit multiplications, which takes 40960 clocks on the Pentium. An exponentiation with a 2048-bit exponent requires up to 4096 multiplications if done in the normal way, which requires about 167 million clock cycles. If the Pentium is running at 166 MHZ, the entire operation requires roughly one second. This example does not consider the time required to perform the divisions, either! Clearly, a common CPU such as a Pentium cannot expect to do key generation and exchange at any great rate.
Because public-key algorithms are so computationally intensive, they are typically not used to encrypt entire messages. Instead, private-key cryptosystems are used for message transfer. The private key used to encrypt the message, called the session key, is chosen at random and encrypted using a public key. The encrypted session key, as well as the encrypted message, are then sent to the other party. The other party uses its secret key to decrypt the session key, at which point the message may be decrypted using the session key. A different session key is used for each communication, so that if one session key is ever broken, only the one message encrypted with it may be read. This public-key/private-key method can also be used to protect continuous communications, such as interactive terminal sessions that never terminate in normal operation. In this case, the session key is periodically changed (e.g. once an hour) by repeating the public-key generation technique. Again, frequent changing of the session key limits the amount of data compromised if the encryption is broken.
Prior Art
Network-level encryption devices, allowing access to corporate networks using a software-based solution are experiencing widespread usage. Products such as Raptor Eagle Remote and others perform encryption entirely in software. The software limits the encryptor""s throughput. Session key generation using public-key techniques may take several minutes. For this reason, session keys are not re-generated as often as some people would like. However, software does have the advantage that the encryption algorithms are easily changed in response to advances in the field.
Other devices use a combination of hardware and software. For example, the Northern Telecom (now Entrust) Sentinel X.25 encryption product uses a DES chip produced by AMD to perform the DES secret-key encryption. Hardware implementations of DES are much faster, since DES was designed for efficient implementation in hardware. A transposition that takes many CPU instructions in software can be done using parallel special-purpose lookup tables and wiring.
The Sentinel also makes use of a Motorola DSP56000 processor to perform the public-key operations. At the time, the single-cycle multiplication ability of the DSP made this approach significantly faster than implementing the public-key algorithms on regular CISC microprocessors.
Most hardware encryption devices are severely limited in the number of algorithms that they can implement. For example, the AMD chip used in the Sentinel performs only DES. More recent devices, from Hi/Fn can perform DES and RC4. However, if you need to implement either RC5 or IDEA, then you would need to use another product.
A preferred high-performance programmable network encryption device, integrated into a single chip, is a parallel-pipelined processor system whose instruction set is optimized for common encryption algorithms. The present invention realizes the advantages of both hardware and software approaches. Since the processor is a programmable processor, any encryption algorithm may be implemented, contrary to a hardware implemented encryption processor which is dedicated to executing only one algorithm. However, the processor""s architecture permits parallel computations of a nature useful for encryption, so its performance more closely approximates that of a dedicated hardware device.
In accordance with a preferred implementation of the invention, an electronic encryption device comprises an array of processing elements. Each processing element comprises an instruction memory for storing a round of an encryption algorithm, the round comprising a sequence of instructions. The processing element also includes a processor for implementing the round from the instruction memory and data storage for storing encryption data operands and encrypted data resulting from implementing the round. Each processing element of the array implements one of the rounds and transfers results to successive processing elements such that the array of processing elements implements successive rounds of the encryption algorithm in a processing element pipeline.
In a preferred embodiment, the data storage has a portion thereof which is shared between adjacent processing elements of the linear array for transfer of data between adjacent processing elements of the linear array. The shared data storage is preferably comprised of dual port memories but may also comprise shared registers.
The preferred processing element comprises a control unit and an ALU. The control unit, ALU, instruction memory and data storage, including local data memory and shared data memory, are connected to a local processing element bus. The local bus is segmented by a switch into a local instruction bus segment, connecting the instruction memory and the control unit, and a local data bus segment connecting the ALU, local data memory and shared data memory. The switch permits either independent simultaneous operation on the two local bus segments or a communication between the two bus segments. Each processing element further comprises a multiplier for performing multiplication operations within the processing element.
The preferred encryption device further comprises a global random access memory and a global bus through which data is transferred between the global random access memory and the processing element data storage. A central processor is coupled to the global bus for processing data words which are wider than data words processed by the processing elements. The multipliers of the plural processing elements may be adapted for concatenation as segments of a wider multiplier used by the central processor. Preferably, each multiplier comprises partial product adders having input selection circuitry for selecting a first set of inputs when operating as an individual multiplier and a second set of inputs, including inputs from adjacent processing elements, when concatenated.
Preferably, the central processor comprises a novel adder. In the adder, each of plural adder segments has a carry output and a sum output and each of the adder segments processes a segment of each of two operands. Selectors select the carry outputs as carry inputs to successive adder segments for successive clock cycles so long as any carry results in an adder cycle. Selectors also select each sum output as an operand input to the same adder segment. Accordingly, so long as any carry results in an adder cycle, the sum output of an adder is fed back to its input, and the adder segment receives a carry input generated as a carry output from a preceding segment in a preceding cycle.
Preferably, each processing element performs a modular adjust operation to compute M mod N without using a divide circuit. Each processing element also performs a modulo add/subtract operation to compute Axc2x1B mod N. Further, each processing element performs a modulo multiply operation to compute Axc3x97B mod N.