1. Field of the Invention
The present invention relates to the field of workload partitioning in a parallel system with processing units having alignment constraints and more particularly to parallelizing loops for execution in a target computing architecture implementing a shared memory model.
2. Description of the Related Art
Workload partitioning focuses on distributing work known to be parallel among the multiple processing elements of a computing system. Processing elements can include threads executing on a single processor, multiple processors in a single core, multiple processors in different cores, or any combination thereof. When partitioning work amongst processing elements, maximum job size, load balancing, and latency hiding reflect primary considerations.
In this regard, when a memory subsystem is known to function optimally with a given working set size, the partitioning algorithm of the memory subsystem will chunk work in subsets of units, each of which has a working set size smaller or equal to the given working set size. Also, with respect to load balancing, a typical partition algorithm attempts to partition work amongst different processing elements as evenly as possible. Finally, with respect to latency hiding, a typical partition algorithm aims to split work into smaller chunks than an established maximum working set size so that several tasks may be “in flight” concurrently for a given set of processing elements.
Workload partitioning is generally well understood for homogenous systems where all the processing elements have similar characteristics. For heterogeneous systems, however, workload partitioning is less well understood. Specifically, in the context of heterogeneous systems, Cell Broadband Engine (CBE) technologies have introduced a new class of processing element demonstrating a new set of heterogeneous constraints. These constraints can significantly impact overall system performance when not taken into account while partitioning work amongst the processing elements of the CBE.
As it is well-known, CBE is a heterogeneous multicore processor chip that incorporates Synergistic Processor Elements (SPEs) as high-performance processing cores and a PowerPC Processor Element (PPE) as a general-purpose processor core, connected to a high-speed input/output (I/O) and a high-speed memory system by a high-bandwidth internal bus referred to as the Element Interconnect Bus (EIB). Of note, the CBE enjoys a scalable architecture that is optimized for parallel and distributed broadband computing environments.
The SPE processing units of the CBE incorporate a SIMD architecture with alignment constraints. That is, when a SIMD memory operation is performed in the SPE, the lower four bits of the address are silently discarded and the SIMD units load and store sixteen bytes of data to the truncated address. The silent discarding of the lower four bits of the address has significant consequences when parallelizing code among SIMD-only SPEs, as this truncating of address and mandatory sixteen-byte reading and writing of memory can generate false sharing conditions at the boundaries of tasks between two SPEs. More precisely, unless properly taken care of, each SPE must take additional steps to track which value within a sixteen byte boundary is written so that, when a particular SPE is only responsible for generating a subset of these sixteen bytes, it does not accidentally overwrite the values for which the particular SPE is not responsible.
To adapt program code for operation in connection with a CBE, a process of SIMD vectorization must be applied to the program code at compile time. SIMD vectorization, also known as “simdization”, is an algorithm implemented within the compiler producing code for the CBE. Successful simdization involves several tasks that closely interact with each other. First SIMD parallelism must be extracted from program code, which may involve extraction from a basic-block, a loop, or a combination of both. Once that parallelism is extracted, various additional SIMD hardware constraints must be satisfied, such as memory alignment requirements, physical vector length, and hardware instruction set.
The alignment constraint of SIMD memory units is a hardware feature that can significantly impact the effectiveness of the simdization of a computing architecture. For example, memory operations in one popular type of SIMD memory unit can only access sixteen-byte contiguous memory from sixteen-byte aligned addresses. As such, in order to satisfy alignment constraints imposed by the hardware, the data reorganization codes must be inserted to explicitly realign data during the simdization process. Additional alignment handling overhead may be added if alignments of some memory accesses in the codes are only known at runtime. It is to be recognized, however, that the code overhead associated with simdization can impose a severe performance impact when the chunk size is not carefully chosen, thus causing additional alignment handling overhead.