Many types of modern computing systems incorporate multiple processors and accelerators operating within a single address space. An important subclass of such systems are those with many loosely-coupled (i.e., not on the same die/chip) processors or accelerators, each with some amount of directly-attached memory, which also have the ability to access memory elsewhere in the system albeit at increased cost. Examples include systems with multiple discrete graphics processing units (GPUs) and the emerging class of in-memory or near-memory processing devices. Due to the highly efficient access to the directly attached “local” memories, application programs written for these systems will mostly operate out of that local memory with only infrequent accesses to other memories in the system.
Most traditional cache coherence mechanisms rely on either a broadcast mechanism that makes any memory access by any processor visible to all other coherent processors, or a directory structure that uses a system of probes and responses with local probe filters that store information about which processors are caching what cache lines. However the broadcast approach does not scale to large numbers of coherent processors or to bandwidth-intensive devices such as GPUs and processors-in-memory (PIMs) due to high bandwidth needed. The directory approach is more scalable but incurs high storage overheads and design complexity to track what processor is caching what data and how to keep them coherent. Furthermore, the directory based coherence protocols require knowledge of the number of coherent processors in the system at design time, or incur further overhead.
A simple approach to cache coherence is to make shared data uncacheable. However, this often leads to significant performance degradations due to inefficient use of memory bandwidth and long load latencies as a result of not being able to exploit temporal and spatial locality. Some early implementations of cache coherent GPUs and other accelerators exploited relaxed memory models to provide low-cost coherence via cache flushes. In these systems, caches are flushed at synchronization points to flush out cached writes so that they are visible to other entities in the system, and to purge local copies so that subsequent reads will pull in updates from other devices in the system. However, cache flushing is expensive for fine-grain data sharing or synchronization because it evicts the entire contents of the cache. Thus, existing scalable solutions to cache coherence either incur high storage and communication costs or result in degraded performance.
In the following description, the use of the same reference numerals in different drawings indicates similar or identical items. Unless otherwise noted, the word “coupled” and its associated verb forms include both direct connection and indirect electrical connection by means known in the art, and unless otherwise noted any description of direct connection implies alternate embodiments using suitable forms of indirect electrical connection as well.