Certain applications such as games, virtual reality/VR environments and media players may use embedded designs, graphics processing units (GPUs), etc., to handle compute intensive workloads. In such a case, a central processing unit (CPU) may dispatch a workload to, for example, a GPU in the form of one or more commands, wherein the GPU may internally execute a work group containing multiple work items in response to the one or more commands. In order to maintain sequential consistency between work items and work groups on the GPU, solutions such as memory fences may be used. Memory fences, however, may cause threads to stall while waiting for other threads handling items from the work group to reach a synchronization point. Stalling threads may lead to processing “bubbles” that have a negative impact on performance.