This disclosure relates generally to the field of general-purpose computing on graphics processing units (GPGPU) and how to perform optimization on developer defined workgroup characteristics. More particularly, but not by way of limitation, this disclosure relates to techniques for coalescing (e.g., combining) work items in a workgroup when the workgroup size appears large and aliasing workgroups to coalesce work items from different workgroups when the workgroup size appears too small. In some instances these two techniques can be used together to reduce overall overhead associated with a work task.
In the field of parallel computing utilizing Graphics Processing Units (GPUs), several computing languages are available. For example, OpenCL and OpenGL are standards utilized by developers to interface with GPUs. GPUs can have many cores that run in parallel to process programs called “kernels” where each kernel will process over a number of work items in parallel. A developer will define the “best” N-dimensioned index space for their algorithm. Kernels are then executed by the GPU across a global domain of work items while the work items are grouped into local work groups. For example the global dimension may be 1024 by 1024 pixels and represent the whole problem space with a plurality of work groups executing together across a set of 32 by 32 local dimensions. That is, the work is defined as global work divided into a set of work groups which are in turn divided into a set of work items.
To load a work group to a GPU, it is required to initialize data associated with the work group and to load functional code associated with each work group and metadata associated with each work item. This load (initialization) time represents an amount of overhead in addition to the overhead of actually doing the work of all the work items within the work group. Because the developer cannot predict the technical capabilities that might be available at a future run-time for their application it is not always practical for a developer to tune for “optimal” parameters for a given hardware platform.