Field of the Invention
Embodiments of the present invention generally relate to parallel processing and, more specifically, to an application programming interface (API) to enable the construction of pipeline parallel programs.
Description of the Related Art
APIs for GPUs fall into two major categories: traditional raster graphics APIs, such as Open Graphics Library (OpenGL) and Direct3D, and compute APIs, such as C with compute unified device architecture (CUDA) extensions, Open Computing Language (OpenCL), DirectCompute, and C++ accelerated massive parallelism (AMP).
Traditional raster graphics APIs provide a high-level hardware abstraction that exposes a graphics pipeline to developers. The graphics pipeline includes a predetermined sequence of stages that can be executed in parallel to process a collection of graphics primitives that flow through the pipeline. Modern raster graphics APIs allow developers to add a variety of state settings and programmable shader stages to the graphics pipeline, thereby increasing the flexibility with which the pipeline may be programmed. However, the overall structure of the pipeline is essentially fixed and subject to a wide variety of other limitations. For example, a given stage may be capable of executing only certain types of algorithms, and not others. Consequently, traditional raster graphics APIs may only be relied upon for generating a limited variety of graphics processing pipelines.
Traditional compute APIs provide a different abstraction than that provided by traditional raster graphics APIs by only exposing a subset of the GPU hardware compared to that exposed by raster graphics APIs. With compute APIs, only the programmable data-parallel cores that execute the shader stages associated with the graphics pipeline mentioned above are available for programming. However, these parallel cores are exposed via a highly constrained execution model referred to as a “bulk synchronous” execution model. In this type of model, a sequence of kernels may be executed, although between kernels, any shared (on-chip) memory or caches used by a previous kernel must be flushed. This constraint often results in redundant memory load operations between kernels that require similar memory resources. Additionally, with the bulk synchronous programming model, all of the threads associated with a given kernel must complete before a subsequent kernel may be executed, which presents an additional constraint to developers. Lastly, compute APIs are typically provided with few synchronization mechanisms, which makes the development of certain types of programs infeasible.
In recent years, various attempts have been made to program GPUs to implement processing pipelines of arbitrary structure and complexity. Such a pipeline would, ideally, be capable of exhibiting braided parallelism with ordering requirements. However, traditional raster graphics APIs are poorly suited for such functionality due to the inflexibilities of those APIs discussed above. Likewise, traditional compute APIs only provide limited access the GPU hardware, and that access is far too constrained to provide a reasonable framework from which to implement a processing pipeline of arbitrary structure and complexity.
As the foregoing illustrates, what is needed in the art is an API for programming GPUs that allows developers to generate processing pipelines having greater complexity and architectural variation relative to graphics processing pipelines generated using conventional APIs.