In an ideal multi-core (multi-processor or heterogeneous) system, different included processing units are connected to the same cache hierarchy under the same memory space. However, some multi-processor computing systems may provide coherency only in part of the systems, as implementing full coherency schemes is often very costly. For example, in a system having a central processing unit (CPU), a graphical processing unit (GPU), and a digital signal processor (DSP), only the CPU and GPU may utilize coherency. In many typical systems, each processing unit utilizes its own cache hierarchy and memory space. Data for certain tasks executed by the processing units may be transferred between various memory units in order to enable particular processing units to perform the associated tasks. For example, a CPU may utilize data within a first data store for a first task, a GPU may use data within the first data store and a third data store for a second task, a DSP may use data within the first data store and a second data store for a third task, and the CPU may use data within the first data store and a fourth data store for a fourth task. As each data transfer within a multi-core computing system can require a costly cache flush that includes writes to memory, transferring data for use by different tasks and/or to offload work to various cores often incurs significant overhead costs.