To enhance processing efficiency, a processor typically employs multiple modules, referred to as compute units (CUs), to execute operations in parallel. For example, a processor can employ a graphics processing unit (GPU) to execute graphics and vector processing operations. To support efficient execution of these operations, the GPU includes multiple CUs to execute the operations in parallel. However, communication and bus bandwidth for the CUs can impact the overall efficiency of the processor. For example, in the course of executing the graphics and vector processing operations the CUs frequently store and retrieve data from a memory hierarchy connected to the CUs via a communication fabric, such as a bus. The communication traffic supporting these data transfers can consume an undesirably large portion of the communication fabric's available bandwidth, thereby reducing overall processing efficiency at the GPU.