Graphics processing units (GPU) generally comprise multiple processing elements that are ideally suited for executing the same instruction on parallel data streams, as in the case of a single instruction multiple data (SIMD) device or in data-parallel processing. In many computing models a central processing unit (CPU) functions as the host or controlling processor and hands-off specialized functions, such as graphics processing, to other processors such as GPUs.
Multi-core CPUs, where each CPU has multiple processing cores, offer processing capabilities for specialized functions (e.g., graphics processing) similar to those available on a GPU. One or more of the computation cores of multi-core CPUs or GPUs can be part of the same die (e.g., AMD Fusion™) or, alternatively, in different dies (e.g., Intel Xeon™ with NVIDIA GPU). Recently, hybrid cores having characteristics of both CPU and GPU (e.g., AMD Accelerated Processing Units (APUs), CellSPE™, Intel Larrabee™) have been proposed for general purpose GPU (GPGPU) style computing. The GPGPU style of computing advocates using the CPU to primarily execute control code and to offload performance critical data-parallel code to the GPU. The GPU is primarily used as an accelerator. The combination of multi-core CPUs and GPGPU computing model encompasses both CPU cores and GPU cores as accelerator targets.
Several frameworks have been developed for heterogeneous computing platforms that have CPUs and GPUs. These frameworks include BrookGPU by Stanford University, the compute unified device architecture (CUDA) by NVIDIA, and OpenCL by an industry consortium named Khronos Group. The OpenCL framework offers a C-like development environment which users can create applications for the GPU. OpenCL enables the user, for example, to specify instructions for offloading some computations, such as data-parallel computations, to a GPU. OpenCL also provides a compiler and a runtime environment in which code can be compiled and executed within a heterogeneous computing system.
The computing model embodied by OpenCL, CUDA and many low level GPU intermediate languages is sometimes referred to as a single instruction multiple thread (“SIMT”) processing or single process multiple data on SIMD (“SPMD-on-SIMD”). In a frequently used implementation of the SIMT model, SIMD execution using hardware mask sets on vectors is used to simulate threading to a finer grain than what is available in the hardware. In the SIMT model, what is presented as threads to the user are implemented using SIMD vectors. A “SIMD vector,” as used herein, refers to a set of workitems executed on a processor or processing element that is capable of SIMD execution. More specifically, each lane of a SIMD vector is presented to the user as a separate thread. However, the entire SIMD vector can execute only one instruction at a time. Therefore, if two of the simulated threads corresponding to two lanes in a SIMD vector have diverging control flow instructions, only one of these control flow paths can be executed by the hardware at a time.
In order to efficiently utilize computing models where the CPU and the GPU can both be used for the execution of many types of code, more flexible thread synchronization models are needed. The conventional workitem synchronization available in OpenCL, for example, may be less efficient when used with workitems that are not traditional graphics processing tasks. For example, whereas respective concurrent workitems in a graphics processing task may not often synchronize after diverging, in CPU associated applications, synchronization may be relatively higher. The synchronization model provided in OpenCL, by itself, is inadequate to handle such dynamic behavior of workitems.
In OpenCL, workitems within a workgroup can be synchronized. Within a workgroup, a barrier instruction can be issued with the semantics that all workitems in the workgroup must reach the barrier before any can proceed past the barrier. A “barrier” blocks all processes that reach it until the semantics noted above is satisfied. It then releases the blocked processes to resume their respective processing. However, in OpenCL and other similar conventional frameworks, barrier use in control flow is heavily restricted. In particular, when the control flow diverges between workitems of the same workgroup, synchronization between the workitems may fail. Moreover, when barriers are used, each workitem that is blocked at the barrier impose processing overhead because they are generally implemented as active spin locks.
Thus restrictions associated with many of the current synchronization techniques limit the ability of the system and the programmer to optimally utilize processing resources.