Certain applications such as games and media players may use embedded designs, graphics processing units (GPUs), etc., to handle compute intensive workloads. In such a case, a central processing unit (CPU) may dispatch the workloads to, for example, a GPU in the form of one or more commands, wherein the GPU may internally execute multiple threads in response to the commands. Each thread in a thread group will run in parallel, executing a kernel, as part of the workload running on the GPU. While such an approach may be suitable under certain circumstances, there remains considerable room for improvement. For example, a given GPU may have several caches containing data that becomes stale during execution of the workloads, wherein conventional computing solutions may rely on the CPU to flush and/or invalidate the caches of the GPU. Such an approach may result in suboptimal performance and an increased memory footprint (e.g., a relatively high number of memory allocations).