Modern graphics processing units (GPUs) include an array of cores, referred to as execution units (EUs) to process instructions. A set of instructions comprise a kernel. Kernels are dispatched to the GPU in the form of multiple threads. The GPU may process the threads of the kernel (e.g., execute the instructions corresponding to the kernel) using the EUs.
Many kernels, particularly kernels corresponding to encoded display data contain dependencies between threads in the kernel. Said differently, execution of some of the threads in the kernel must wait for the threads from which they depend to be executed before their own execution can be started. As such, only a subset of the total number of threads in a kernel can be executed by a GPU in parallel. The number of threads in each subset that can be executed in parallel may be less than the total number of EUs. As a result, some of the EUs may be idle (e.g., not processing a thread) due to the dependencies between threads and the limitations of the number of threads that can be processed in parallel. As can be appreciated, this may result in an under utilization of the EUs in a GPU and may create a bottleneck in the overall processing pipeline.
Conventionally, a kernel developer may attempt to improve EU utilization by rewriting and merging multiple kernels into a combined kernel. Said differently, the threads in each of the kernels may be combined into a single thread space. As can be appreciated, however, this requires a substantial amount of manual work by the kernel developer. Furthermore, the combined kernel has a much bigger footprint than the separate kernels, resulting in more pressure on the hardware resources (e.g., instruction cache, sampler cache, or the like.) Additionally, manually combining kernels is not viable for complicated and/or large kernels due to the amount of manual effort necessary to merge kernels.
Thus, there is a need for techniques to increase the utilization of EUs in a GPU.