Contemporary commodity graphics hardware provides considerable raw processing power at a moderate cost. However, programming the GPU for general purpose computation is relatively difficult. In part this is because existing general-purpose GPU programming languages are based on the stream processing model, because GPUs are stream processors; such languages include Brook, Sh, and NVIDIA Corporation's CUDA.
Stream processing is a data centric model, in which data is organized into homogeneous streams of elements. Individual functions called kernels are applied to all elements of input streams in parallel to yield output streams. Complex computation is achieved by launching multiple kernels on multiple streams. This stream/kernel abstraction explicitly exposes underlying data dependencies.
However, while supplying high performance, the stream processing model makes general purpose GPU programming difficult for several reasons. For one, the program readability and maintenance is a big issue because programs are partitioned into kernels according to data dependencies, instead of functionalities. Adding a new functionality to an existing program usually involves rewriting many unrelated parts of the code. For another, dataflow in a complex application is rarely related to the underlying program logic due to extensive use of intermediate or temporary streams. Explicit dataflow management is therefore tedious and error prone.
Yet another reason for programming difficulty is that abstraction of parallel primitives is difficult, which hinders code reuse. More particularly, many parallel primitives, such as scan and sort, require multiple kernel launches. When such a primitive is called by a kernel, part of the primitive needs to be bundled with the caller to reduce intermediate stream size and kernel launch cost. The result is a primitive with broken integrity, which makes the abstraction of the primitive difficult.
Because of the above problems, it is extremely difficult to write even moderately complex general purpose programs using today's GPU programming languages. Any improvements to GPU programmability are thus highly desirable.