1. Field of the Invention
This invention relates generally to parallel coprocessors and more specifically to the load balancing of parallel coprocessors.
2. Background Information
A computer network is a geographically distributed collection of interconnected communication links for transporting data between nodes, such as computers. Many types of computer networks are available, with the types ranging from local area networks (LANs) to wide area networks (WANs). The nodes typically communicate by exchanging discrete frames or packets of data according to pre-defined protocols, such as the Transmission Control Protocol (TCP).
A computer network is often comprised of one or more intermediate nodes, such as switches or routers. These intermediate nodes typically comprise a central processor that enables the intermediate router to, inter alia, route or switch the packets of data along the interconnected links from a source node that originates the data to a destination node that is designated to receive the data.
To secure data that is transmitted over the interconnected links, e.g., in the case of a virtual private network (VPN), intermediate nodes often incorporate a technique for encrypting and decrypting data contained in the packets. Often this technique employs an encryption standard, such as the conventional Data Encryption Standard (DES) or the triple-DES (3DES), as described in ANSI X9.52-1998, available from the American National Standards Institute, Washington, D.C, to perform the actual encryption of the data. These encryption standards typically encrypt and decrypt data by applying a mathematical transform to the data. Often the processing necessary to apply this mathematical transform is quite intensive, particularly for intermediate nodes configured to encrypt/decrypt VPN traffic over the secure connections. To avoid overburdening the central processor these nodes often employ one or more coprocessors that are specifically dedicated to offload the computational burden associated with encryption from the processor.
A coprocessor is a highly specialized processing unit that is typically dedicated to performing a single function, such as encryption. Coprocessors typically comprise processing elements and logic implemented as, e.g., application specific integrated circuits (ASIC) that are often tailored to enable the coprocessor to perform its dedicated function at a very high rate of speed. Moreover, each coprocessor is typically associated with its own private first-in-first-out (FIFO) queue that is configured to receive packets for processing by the coprocessor.
In a typical intermediate node that contains a central processor and more than one coprocessors, packets are processed by the coprocessors as follows. First, the central processor selects a coprocessor that is to process the packet. Next, the central processor places the packet on the selected coprocessor's FIFO queue. When the coprocessor completes its processing of the packet, it notifies the central processor that the processing has completed. The central processor then performs whatever additional processing may be required such as, routing or switching the packet.
Intermediate devices often employ a scheduling algorithm to schedule the processing of packets on the various coprocessors. One such scheduling algorithm is a conventional round-robin algorithm. In a typical round-robin implementation, coprocessors are selected in a fixed cyclic order. When a packet is ready to be processed, the next coprocessor in the order is selected to process the packet. For example, assume an intermediate device has two identical coprocessors (CP1 and CP2) and the central processor is configured to place packets on the queues using the round-robin algorithm. The processor begins by placing the first packet on CP1's queue. The next packet is then placed on CP2's queue. The cycle then repeats and the next packet is placed on CP1's queue and so on.
One problem associated with the typical round-robin implementation is that depending on the type of packets and the order in which they are received, it is possible for the load among the processors to become unbalanced. Using the example above, assume every packet CP1 receives is a large packet that requires triple-encryption (e.g., 3DES) processing and every packet assigned to CP2 is half the size and only requires single-encryption (e.g., DES) processing. As the scheduling cycle continues, the load on CP1 will become much greater than the load on CP2; thus, the overall load becomes unbalanced as CP1 bears a greater share of the overall load.
Another commonly used scheduling algorithm is the Shortest-Queue-First (SQF) algorithm. The SQF algorithm uses the number of entries in a queue as criteria for selecting a coprocessor that is to process a packet. The coprocessor with the least number of entries in its FIFO queue is the coprocessor that is selected. Using the example above, assume the central processor uses the SQF algorithm to schedule packet processing on CP1 and CP2, and that CP1 has 2 entries on its queue and CP3 has 3 entries on its queue. Further assume the central processor has a packet that needs to be processed by one of the coprocessors. To select a coprocessor, the processor looks at the number of entries on the queues for both CP1 and CP2 and chooses the coprocessor whose queue has fewer entries. Since CP1 has fewer entries on its queue, it will be selected to process the packet.
One problem with the SQF algorithm is that it does not take into consideration the amount of resources that may be required to process a particular packet. Thus, like the round-robin algorithm, an imbalance in the load between coprocessors may be introduced depending on the packets being processed. For example, assume CP1 has three 100-byte packets on its queue requiring DES processing and CP2 has two 1400-byte packets on its queue requiring 3DES processing. Further assume, a 50-byte packet requiring DES processing is to be scheduled for processing. The central processor will place the 50-byte packet on CP2's queue rather than CP1's queue simply because CP2's queue has fewer entries despite the fact that those entries may require much more processing than the entries on CP1's queue. CP2 will incur a greater share of the load and the overall load among the coprocessors is unbalanced.
Both the round-robin and SQF techniques do not select a coprocessor on the basis of the load incurred by the coprocessor. Rather these techniques select a coprocessor using some other metric, such as queue size or the number of packets received. Thus it is quite possible for the load among the coprocessors to become significantly unbalanced where some coprocessors are heavily loaded while others are not. It would be desirable to have a technique that optimally allocates the processing of packets among a series of coprocessors to ensure that the allocation will not inordinately unbalance the load among the coprocessors.