Data processing systems, such as a System-on-a-Chip (SoC), may contain multiple processor hosts, multiple data caches and shared data resources. The multiple hosts typically have identical or at least similar processing capabilities, so such a system may be termed a Uniform Compute Device. Data to be processed is retrieved from a shared data resources and is moved up to the highest level cache (level one or L1) for processing. Processing results are be moved down to the lowest level cache and then stored in a shared data resource. A result of this approach is that processing is delayed when the required data is not available and must be retrieved from a shared data resource or lower level cache.
An alternative approach is to add “processing-in-memory” (PIM) elements, also called “compute-near-memory” (CNM) elements or the like. In this approach, logic elements and memory elements (such as dynamic random access memory (DRAM)) are integrated in a common integrated circuit. The logic elements execute separate PIM instructions that are created prior to execution. A special processing unit for managing these instructions is added next to each host and a PIM monitor is added next to the last level cache. In this approach, the data paths of the PIM instructions are separated from the normal instructions. In turn, this requires significant communication between hosts, the monitor and the special processing units. A significant disadvantage of this approach is that it does not fully utilize the resources provided by the host processor. For example, if the accessed data has poor data locality, the scheduler will still send the PIM instructions to execute in (near) memory, even though the host is idle and processing units in memory are fully occupied. In addition, the PIM instructions are executed atomically, without speculation.