1. Field of the Invention
The present invention relates generally to workitem synchronization.
2. Background Art
Graphics processing units (GPU) generally comprise multiple processing elements that are ideally suited for executing the same instruction on parallel data streams, as in the case of a single instruction multiple data (SIMD) device, or in data-parallel processing. In many computing models, a central processing unit (CPU) functions as the host or controlling processor and hands-off specialized functions, such as graphics processing, to other processors such as GPUs.
Multi-core CPUs, where each CPU has multiple processing cores, offer processing capabilities for specialized functions (e.g., graphics processing) similar to those available on the GPU. One or more of the computation cores of multi-core CPUs or GPUs can be part of the same die (e.g., AMD Fusion™) or, alternatively, in different dies (e.g., Intel Xeon™ with NVIDIA GPU). Recently, hybrid cores having characteristics of both CPU and GPU (e.g., CellSPE™, Intel Larrabee™) have been proposed for general purpose GPU (GPGPU) style computing. The GPGPU style of computing advocates using the CPU to primarily execute control code and to offload performance critical data-parallel code to the GPU. The GPU is primarily used as an accelerator. The combination of multi-core CPUs and GPGPU computing model encompasses both CPU cores and GPU cores as accelerator targets. Many of the multi-core CPU cores have performance that is comparable to GPUs in many areas.
Several frameworks have been developed for heterogeneous computing platforms that have CPUs and GPUs. These frameworks include BrookGPU by Stanford University, the compute unified device architecture (CUDA) by NVIDIA, and OpenCL by an industry consortium named Khronos Group. The OpenCL framework offers a C-like development environment which users can create applications for the GPU. OpenCL enables the user, for example, to specify instructions for offloading some computations, such as data-parallel computations, to a GPU. OpenCL also provides a compiler and a runtime environment in which code can be compiled and executed within a heterogeneous, or other, computing system.
The computing model embodied by OpenCL, CUDA and many low level GPU intermediate languages, is sometimes known as a single instruction multiple thread (“SIMT”). In a frequently used implementation of the SIMT model, SIMD execution using hardware mask sets on vectors is used to simulate threading to a finer grain than what is available in the hardware.
In order to efficiently utilize computing models where the CPU and the GPU can both be used for the execution of code of many types, more flexible thread synchronization models are needed. The conventional workitem synchronization available in OpenCL, for example, may be less efficient when used with workitems that are not traditional graphics processing tasks. For example, whereas respective concurrent workitems in a graphics processing task may not often diverge, in CPU associated applications, divergence of execution may be relatively higher. The synchronization model provided in OpenCL, by itself, is inadequate to handle such dynamic behavior of workitems.
In OpenCL, workitems within a workgroup can be synchronized. Within a workgroup, a barrier instruction can be issued with the semantics that all workitems in the workgroup must reach the barrier before any can proceed past the barrier. A “barrier” blocks all processes that reach it until the semantics noted above is satisfied. It then releases the blocked processes to resume their respective processing. However, in OpenCL and other similar conventional frameworks, barrier use in control flow is heavily restricted.
FIG. 1A illustrates the use of a single workitem (referred to in FIG. 1A as a kernel) to load a value into a group-shared memory space from which other workitems can obtain the loaded value. The workitem that loads the value, as well as other workitems in the workgroup, are blocked from proceeding beyond the barrier until all workitems in the group reach the barrier.
The above operation of loading the value into group-shared memory can also be done from within a library function which is called from the kernel code. FIG. 1B illustrates a library function including a barrier instruction. FIG. 1C is a kernel that calls the library function. The code in FIG. 1C illustrates an operation to block all workitems that call the corresponding library until the designated workitem has loaded the data to the shared area.
FIG. 1D illustrates an example where placing a call to the barrier inside a library may lead to incorrect operations. For example, calling a library function that includes the barrier instruction from a kernel having a conditional, in which one of the conditionals does not have a call to the function, may lead to deadlock. This is because while the barrier would release only when all workitems of a group have reached it, one or more workitems, for which a condition is not fulfilled, would not reach the barrier at all.
Moreover, in the workitem synchronization framework of OpenCL, workitems in a workgroup executing the kernel must execute the barrier instruction before any are allowed to continue execution. The barrier instruction must be encountered (i.e., reached in the instruction stream) by all workitems in a workgroup executing the kernel. If the barrier instruction is inside a conditional statement, then all workitems must enter the conditional if any workitem enters the conditional statement and executes the barrier. If the barrier instruction is inside a loop, all workitems must execute the barrier instruction for each iteration of the loop before any are allowed to continue execution beyond the barrier. These restrictions can limit the system's and the programmer's ability to optimally utilize processing resources.
What is needed, therefore, are a method and system that enable more flexible and efficient use of workitem synchronization.