Field of the Invention
The present invention generally relates to computer hardware and more specifically to a method and system for distributing work batches to processing units.
Description of the Related Art
A modern computer system may include one or more processing units that operate in parallel to perform a variety of processing tasks. FIG. 1 illustrates one such computer system. As shown, computer system 10 includes processing units 12-1, 12-2, and 12-n. A work distribution unit 14 is coupled to each of the processing units 12 and distributes work batches 16-1, 16-2, and 16-n to processing units 12-1, 12-2, and 12-n, respectively. As referred to herein, a “work batch” includes a set of processing tasks to be performed by a particular processing unit 12. Work distribution unit 14 distributes work batch 16-1 to processing unit 12-1, work batch 16-2 to processing unit 12-2, and work batch 16-n to processing unit 12-n. Processing units 12-1, 12-2, and 12-n then perform the processing tasks specified by work batches 16-1, 16-2, and 16-n using processors 18-1, 18-2, and 18-n, respectively.
Work distribution unit 14 may distribute work batches 16 to processing units 12-1 to 12-n based on a variety of well-known distribution policies. One example of a distribution policy is referred to in the art as a “round-robin” policy. According to the round-robin policy, the work distribution unit transmits a work batch to each processing unit 12 in the sequence of processing units 12-1 to 12-n. When work distribution unit 14 reaches the end of the sequence (processing unit 12-n), the work distribution unit returns to the beginning of the sequence (processing unit 12-1) and continues to distribute additional work batches 16 to the sequence of processing units 12-1 to 12-n, starting with processing unit 12-1. When work distribution unit 14 reaches a processing unit 12 that has not yet finished processing a work batch 16 that was previously distributed to that processing unit 12, work distribution unit 14 stalls until processing unit 12 has finished processing the previously distributed work batch 16.
One problem with this approach is that processing units 12 may not all have equivalent processing capabilities. For example, processing unit 12-1 may include just one processor 18-1, while processing units 12-2 and 12-n may each include more than one processing units. Thus, processing unit 12-1 may require a disproportionate amount of time to finish processing work batch 16-1 compared to the amount of time required by processing units 12-2 and 12-n to finish processing work batches 16-2 and 16-n, respectively. Consequently, work distribution unit 14 may repeatedly become stalled when attempting to distribute additional work batches to processing unit 12-1, thereby reducing the processing throughput of the computer system 10.
In addition, certain work batches 16 may require significantly more processing time to complete than others due to variance in the complexity of the processing tasks required to complete each batch 16. Accordingly, the work distribution unit 14 may become stalled while waiting for a processing unit 12 to finish processing a batch 16 of increased complexity, thus further reducing the throughput of the computer system 10. When a processing unit 12 that has diminished processing capabilities receives a batch 16 of increased complexity relative to other batches 16, the processing throughput of the computer system 10 may be reduced dramatically.
A common solution to this problem is to cause each processing unit 12 to assert a “load” signal to work distribution unit 14 when the processing of a work batch 16 is complete. For example, processing unit 12-1 could assert a load signal 20-1 when the processing of work batch 16-1 is complete. Likewise, processing unit 12-2 could assert a load signal 20-2 when the processing of work batch 16-2 is complete, and processing unit 12-n could assert a load signal 20-n when the processing of work batch 16-n is complete. When a given processing unit 12 has not asserted the load signal 20, work distribution unit 14 skips that processing unit 12 when distributing work batches 16 to the sequence of processing units 12-1 to 12-n. Through this technique, work distribution unit 14 cannot be stalled by a processing unit 12 that has not yet finished processing a batch 16 because such a processing unit 12 is simply skipped.
However, work distribution unit 14 may have to wait for the transmission of load signal 20 to complete. Once transmission of load signal 20 is complete, work distribution unit 404 then requires additional time to transmit a batch 16 to the processing unit 12 that transmitted load signal 20. These latencies correspond to idle cycles on the processors 18, which inhibit the performance of the computer system 10.
One approach to solving this problem is to include a work FIFO within each processing unit 12. As shown, processing units 12-1, 12-2, and 12-n include work FIFOs 22-1, 22-2, and 22-n. Each work FIFO 22 stores a plurality of work batches 16 received from work distribution unit 14. When space becomes available within a given FIFO 22, work distribution unit 14 distributes an additional work batch 16 to that FIFO 22. Although this approach may reduce the idle cycles on the processors 18, it allows work to complete massively out of order and it may leave one processor with a large queue of long work when the computer system 10 is waiting for idle.
As the foregoing illustrates, what is needed in the art is a more effective technique for distributing work batches to processing units that have different processing capabilities.