Parallel processors, such as graphics processing units (GPUs), are powerful devices that may be used for performing complex general purpose computations. Programming languages and application programming interfaces (API's) such as Open Computing Language (OpenCL) and Compute Unified Device Architecture (CUDA) have been developed for efficient programming of these devices.
A kernel is a program containing multiple threads that executes on a computing device. A kernel contains blocks of threads that operate on many inputs in parallel. Examples of such blocks are workgroups in OpenCL and thread blocks in CUDA. When programmers write a program using an API such as OpenCL or CUDA, they must assume that each block in a kernel is independent. A programmer can make no assumptions about the order in which blocks are executed in hardware. In addition, because hardware scheduling policies may vary across vendors, code written for one platform may not perform well on another.