As computer systems have advanced, graphics processing units (GPUs) have become increasingly advanced both in complexity and computing power. GPUs are thus used to handle processing of increasingly large and complex graphics. As a result of this increase in processing power, GPUs are now capable of executing both graphics processing and more general computing tasks. The ability to execute general computing tasks on a GPU has lead to increased development of programs that execute general computing tasks on a GPU and the corresponding need to be able to perform an increasing number of the complex programming tasks.
A general-purpose computing on graphics processing units (GPGPU) program executing general computing tasks on a GPU has a host portion executing on a central processing unit (CPU) and a device portion executing on the GPU. With conventional solutions, it is not possible to launch a piece of work (“kernel”) on the GPU from code executing on the GPU. As a result, launching a kernel involves transferring significant amounts of data from the host memory to the GPU memory each time a new kernel is launched from the host or CPU side. For example, for kernels that are to be launched consecutively, the results from kernel launch are transferred from the GPU to host memory and then transferred from host memory to the GPU again when the next kernel is launched. The transferring of data between the GPU and CPU can be a very expensive operation. Further, irregular computation is not possible. For example, recursive algorithms, such as quick sort, cannot be performed from the GPU because the number of threads and other execution properties would not be available until an algorithm is executed.