1. Field of the Invention
The present invention relates generally to workitem synchronization.
2. Background Art
Graphics processing units (GPU) generally comprise multiple processing elements that are ideally suited for executing the same instruction on parallel data streams, as in the case of a single instruction multiple data (SIMD) device or in data-parallel processing. In many computing models a central processing unit (CPU) functions as the host or controlling processor and hands-off specialized functions, such as graphics processing, to other processors such as GPUs.
Multi-core CPUs, where each CPU has multiple processing cores, offer processing capabilities for specialized functions (e.g., graphics processing) similar to those available on a GPU. One or more of the computation cores of multi-core CPUs or GPUs can be part of the same die (e.g., AMD Fusion™) or, alternatively, in different dies (e.g., Intel Xeon™ with NVIDIA GPU). Recently, hybrid cores having characteristics of both CPU and GPU (e.g., AMD Accelerated Processing Units (APUs), CellSPE™, Intel Larrabee™) have been proposed for general purpose GPU (GPGPU) style computing. The GPGPU style of computing advocates using the CPU to primarily execute control code and to offload performance critical data-parallel code to the GPU. The GPU is primarily used as an accelerator. The combination of multi-core CPUs and GPGPU computing model encompasses both CPU cores and GPU cores as accelerator targets.
Several frameworks have been developed for heterogeneous computing platforms that have CPUs and GPUs. These frameworks include BrookGPU by Stanford University, the compute unified device architecture (CUDA) by NVIDIA, and OpenCL by an industry consortium named Khronos Group. The OpenCL framework offers a C-like development environment which users can create applications for the GPU. OpenCL enables the user, for example, to specify instructions for offloading some computations, such as data-parallel computations, to a GPU. OpenCL also provides a compiler and a runtime environment in which code can be compiled and executed within a heterogeneous computing system.
The computing model embodied by OpenCL, CUDA and many low level GPU intermediate languages is sometimes referred to as a single instruction multiple thread (“SIMT”) processing. In a frequently used implementation of the SIMT model, SIMD execution using hardware mask sets on vectors is used to simulate threading to a finer grain than what is available in the hardware.
In order to efficiently utilize computing models where the CPU and the GPU can both be used for the execution of many types of code, more flexible thread synchronization models are needed. The conventional workitem synchronization available in OpenCL, for example, may be less efficient when used with workitems that are not traditional graphics processing tasks. For example, whereas respective concurrent workitems in a graphics processing task may not often synchronize after diverging, in CPU associated applications, synchronization may be relatively higher. The synchronization model provided in OpenCL, by itself, is inadequate to handle such dynamic behavior of workitems.
In OpenCL, workitems within a workgroup can be synchronized. Within a workgroup, a barrier instruction can be issued with the semantics that all workitems in the workgroup must reach the barrier before any can proceed past the barrier. A “barrier” blocks all processes that reach it until the semantics noted above is satisfied. It then releases the blocked processes to resume their respective processing. However, in OpenCL and other similar conventional frameworks, barrier use in control flow is heavily restricted. In particular, when the control flow diverges between workitems of the same workgroup, synchronization between the workitems may fail.
For example, calling a library function that includes a barrier instruction from a kernel having a conditional, in which one of the conditionals does not have a call to the function, may lead to deadlock. This is because, while the barrier would release only when all workitems of a group have reached it, a workitem for which a condition is not fulfilled, would not reach the barrier at all. These restrictions limit the ability of the system and the programmer to optimally utilize processing resources.