Various embodiments of this disclosure relate to cache coherence and, more particularly, to improved store operations to maintain cache coherence.
In a computing architecture having a host processor and an accelerator, the processor and accelerator both have processing elements and may share access to main memory. In that case, the host processor may have one or more private caches, such as a Level 1 (L1) cache, for each processing element. In contrast, the accelerator may have no private caches. Cache coherence is the state of the cache being consistent with other caches or main memory. In this case, cache coherence thus requires the processor's private caches to be consistent with main memory.
With a processor-in-memory implementation, such as the active memory cube (AMC), operations come from processing lanes within the AMC and from the host processor, and cache coherence is maintained through the use of a coherence bit replicated in 32-byte sectors of a 128-byte cache line. When set to a value of 1, the coherence bit indicates that the host processor has a copy of the memory line in one of its caches. When set to a value of 0, the coherence bit indicates that the memory line is not stored in any of the host processor's caches.
When a processing lane performs a store operation, a memory controller examines the coherence bit of the memory line being stored to determine whether the host processor must flush any of its copies of the memory line in private caches. To this end, the memory controller performs a read-modify-write operation for each store operation. More specifically, the coherence bit is read to determine whether the memory line exists in a private cache; the coherence bit and line data of the memory line are modified; and the modified memory line is written back to memory. This leads to an increase in latency and a reduction in bandwidth utilization with respect to store operations.
In an AMC, error-correcting code (ECC) bits are used at a granularity of 32 bytes, applicable to a 32-byte sector. The read-modify-write operation cannot be avoided for stores that target a subset of the 32-byte sector, because the ECC bits apply to the entire 32 bytes and must be modified if any data within the 32 bytes is modified. Thus, the entire sector, including both old and newly stored data, must be read to generate the ECC bits. When the store operation applies to a multiple of 32 bytes, the new ECC bits for the applicable sectors can be generated without a read-modify-write. No reading action needs to be performed on the sector in that case because the entire 32 bytes are new to the sector. However, a read operation is still needed to read the coherence bit.