Computer processors function by processing data elements through various registers in accordance with instructions provided by a computer program. The processor executes instructions in the form of machine language, which are the low-level instructions relating to what data elements are processed through which registers. Most software, however, is written in higher-level programming code, such as C++, which has the advantages of being human readable and of embodying relatively complex processing operations using comparatively short, quickly-written commands. A compiler receives the high-level programming code, and based upon the programming of the compiler itself, generates the machine language that is readable by a processor.
Workload partitioning focuses on distributing work known to be parallel among the multiple processing elements of a system. Processing elements can be threads executing on a single processor, multiple processors in a single core, multiple processors in different cores, or any combinations of the above.
When partitioning computations among processing elements, prior work typically takes into consideration maximum job size, load balancing, and latency hiding. For example, when a memory subsystem works best with a given working set size of K bytes, the partitioning algorithm typically chunk the work in subsets of units, each of which have a working set size smaller or equal to K bytes. Another example of consideration is load balancing, where a partition algorithm attempts to partition the work among processing elements as evenly as possible. A third consideration is latency hiding, in which the partition algorithm splits the work in smaller chunks than fitting in the maximum working set size so that several tasks are in flight at a given time for a given processing element. For example, one task in flight may be in the process of transmitting the input data to that processing element, either with DMA or prefetching, while a second task in flight may be computing the output data, and a third task in flight may be transmitting the output back to the memory subsystem or other processing element.
While workload partitioning is generally well understood for homogenous systems where all the processing elements have similar characteristics, for heterogeneous systems where similar processing elements are clustered at various levels in a clustered architecture, there is a new class of processing elements such as the ones present in the Cell Broadband Engine (CBE) that are introducing a new set of heterogeneous constraints that can significantly impact the overall system performance when such heterogeneous constraints are not taken into accounts while partitioning the work among processing elements in the parallel system.