Certain computing applications such as games, media players and the like may utilize processing units, e.g., a central processing unit (CPU), graphics processing units (GPUs), general purpose graphics processing units (GPGPUs), advanced processing units (APUs), and the like, to handle compute intensive workloads. In such a case, a processing unit such as a CPU may dispatch the workloads to, for example, a GPU in the form of one or more commands, wherein the GPU may internally execute multiple threads in response to the commands. The threads executed by the GPU are typically organized in groups called thread groups. Multiple thread groups may need to be executed in order to complete the workload. General purpose programming on GPUs may include the concept of global and local work items. A barrier may be supported across local work items or within a thread group. A barrier, however, across global work items or across thread groups may not be supported by hardware. Such a barrier may require that the execution of all thread groups be completed and synchronized before the next set of workloads can be launched.