Certain applications such as games and media players may use embedded designs, graphics processing units (GPUs), etc., to handle compute intensive workloads. In such a case, a central processing unit (CPU) may dispatch a workload to, for example, a GPU in the form of one or more commands, wherein the GPU may internally execute a work group containing multiple work items in response to the one or more commands. In order to maintain sequential consistency between work items and work groups on the GPU, solutions such as barrier commands (“barriers”), memory fences and unordered atomic operations (“atomics”) may be used. Such solutions, however, may be coarse grained and unable to control specific memory access orderings. Moreover, conventional solutions may be unable to ensure memory consistency (e.g., coherency) across the GPU and the CPU unless hardware flushes and software stack calls are used, wherein flushes and stack calls may have a negative impact on performance.