Current general purpose computation on a graphics processing unit (GPGPU) programming models define a grouping concept which leads to poorly scoped local/shared memory regions and an inconvenient approach to projecting complicated iterations spaces. These models have inherited CUDA™'s group-based programming model that stems largely from the way graphics architectures were designed. These models explicitly launch groups of “work items” (sometimes known as “threads”) that define a fairly static mapping to the architecture. The work items within a single group can communicate with each other, applying synchronization primitives and memory structures that are scoped to logical group constructs.
Mapping complex iteration spaces to this sort of grouped launch directly imposes a loss of information. For example, mapping a 15×15×15 region into 64 work items will work well on specific hardware. But remapping that region onto a different number of work items means re-inferring certain inter-work item communication which is likely to be harder than the initial work to parallelize the loop. In addition, projections like this are not clean for the programmer, and it is difficult to understand the scoping of local memory that is not visible in any sort of lexical scope.
Current programming models for vector and multi-core architectures, which mostly use standard C programming, rely on a relatively compiler-driven approach, such that the compiler infers all of the parallelism for the loops, both for the vectorization and for the thread parallelism.
OpenMP® allows a programmer to access the thread parallelism, but expects the compiler to infer vector parallelism from it. There is little in an OpenMP® loop that identifies a portion of the loop as vectorizable. There is little control over the loop nest to indicate which portions should be vectorizable and should be run under a particular thread. For efficiency, there might be cases where a programmer would want this level of control. For example, when iterating over a fixed-size block of data.
Intel®'s Single Program, Multiple Data (SPMD) compiler tries to map vector programming onto the CPU and solves some of the cleanliness problems with OpenCL™, etc. Intel®'s SPMD program compiler (ISPC) moves in this direction. It has the notion of a SPMD entry point, within which the programmer may write code that is explicitly SPMD. Or the programmer may write code with “for each” blocks that describe code that can be run in parallel, but does not need to be run in threads, indicating that the code should be run in parallel in some way.
OpenACC® exposes these layers by providing hints to the compiler (like an extension to OpenMP®). These hints may include hints to parallelize a given loop, or map a loop to vectors. These hints are a sequence of tuning parameters, rather than strict control over the layers. OpenACC® still relies on the compiler doing the work, but the compiler could fail to vectorize the code in the desired way.