Some complex operations used in encryption require more processing than can be performed during a single processing cycle, such as a single clock cycle of the hardware on which the procedure is executing. These procedures are sometimes implemented in multiple stages, each of which can be completed during a single processing cycle. When there are separate processors available to perform each of the stages, as in an array processor, then throughput is enhanced. Throughput is typically measured in bytes of output per second, where a byte is eight binary digits (bits).
Hardware implementations with a separate processing block, module or element for each stage, each specially designed to perform the arithmetic or logic or both required at its stage, often offer the highest throughput. These implementations allow a data block to be processed during a single clock cycle in any stage. The result is then passed as a block of partially processed data to the next stage to be processed at the next stage during the next clock cycle. In ideal circumstances, the first stage can process another block of data while the next stage is processing the block produced by the first stage during the previous cycle.
However, when the procedure also calls for feedback, in which a block of output data from a later stage is combined with a block of input data at an earlier stage, loss of throughput may result, since the next data block cannot be processed until processing of the current block is complete. This loss of throughput is significant in some circumstances. For example, the loss of throughput is a limiting factor in networks in which encryption and decryption are performed by a common device performing as a network interface for a plurality of network servers.
A server is a process executing on a computing device to provide a computer resource in response to a request message received from a client process executing on another computing device. Computer resources include files of data, files of instructions for particular devices, and the functionality of devices such as printers and scanners. The terms server and client also refer to the computing device on which the server process and client process, respectively, execute.
FIG. 1A is a block diagram of a network encryption system for purposes of illustrating a context in which throughput decline caused by feedback in a multi-stage processing system may occur.
In this example, a bank of server devices 140 is connected to a network 125, such as the Internet, through a gateway 130 and links 141a–141e, which use Transport Control Protocol (“TCP”) and Internet Protocol (“IP”). One or more clients 120a–120d are coupled to network 125 and can request resources from servers 140. Assume, for purposes of illustrating an example, that the system is designed to handle 200 sessions per second per server, where a session is a data stream from a client such as client 120a, carried in one or more packets traveling over the network 125. A typical session contains a number of bits that is expected to range from hundreds of bytes to tens of kilobytes. For five servers, this arrangement amounts to 1000 sessions per second that are to be within the capacity of the gateway 130. A practical system may involve hundreds of servers. To satisfy such loads, a high-performance encryption/decryption engine is needed that processes data streams at many billions of bits per second (Gigabits per second, “Gbps”). For example, an encryption/decryption engine with a throughput of 5 Gbps is desirable.
Block-based symmetric encryption/decryption algorithms have been implemented in integrated circuit hardware devices having clock speeds of 125 million cycles per second (“MHz”) and that use 16 stages to process 64-bit blocks of data. These procedures also employ feedback in which a 64-bit block of data output by the 16th stage is input to the first stage along with the next 64-bit block of data. Commercial integrated circuit chips that implement this engine are available from companies like HiFn, VLSI, Broadcom and others.
FIG. 1B is a block diagram illustrating data undergoing encryption processing using such devices at a particular clock cycle. In FIG. 1B, an input data stream 180 comprising a series of 64-bit blocks travels from left to right. Data blocks on an input queue pass through a first stage 101 and a second stage 102 and intervening stages to a last stage 116, on each successive clock cycle. In one such device, there are 16 stages. The number of bits carried on the paths between stages and queues is known as the channel width 120 and is indicated by a number of bits positioned adjacent to a slash intersecting the path.
Each data block undergoes additional processing that changes its contents as it progresses from one stage to the next. From the last stage 116, the fully processed output block joins an output data stream 190 on an output queue. The fully processed output block is also passed back to the first stage 101 over a feedback channel 119. In the clock cycle illustrated in FIG. 1B, the first, second and third input blocks have passed through the 16 processing stages to become the first, second and third output blocks, 191, 192, 193, respectively. At the clock cycle illustrated, the partially processed fourth block 174 is in the second stage 102. There is no block of data being processed in any of the other stages. The fifth input block 185 waits on the input queue until the fourth block becomes an output block in the last stage 116. The fifth block 185 must wait an additional 14 clock cycles for the fourth block to complete its transit of the 16 stages at the last stage 116.
Because of this feedback requirement, overall data throughput is reduced. Without feedback, a new 64-bit block of output would be produced on each clock cycle, 125 million times per second, for a throughput of 8 Gbps (computed as the product 125×106×64). Because of the feedback requirement, a following block must wait 16 clock cycles for the preceding block to be output before the following block can enter the first stage with the feedback. For example, the fifth input block 185 waits 16 clock cycles for the fourth block to be processed through the last stage 116 before beginning its transit. This reduces throughput 16 times to 0.5 Gbps. As a result, the architecture of FIG. 1B is unsuitable for a gateway with a large number of servers, such as in the example system of FIG. 1A, which are simultaneously serving an even larger number of clients.
Some throughput can be recovered if the clock cycle time is speeded or if more computations can be performed in each cycle at most stages so that one or more stages can be removed. Hardware solutions are typically faster than software solutions. However, hardware speed improvements generally occur as a result of improvements in foundational technologies. Consequently, there appear to be only marginal gains in throughput that can be obtained in the near term by decreasing the number of stages.
Based on the foregoing, there is a clear need for increasing throughput in multi-stage processing systems with feedback in circumstances involving multiple data streams using readily attainable technology.