1. Field
The embodiments are generally directed to managing memory, and more specifically to managing memory among heterogeneous computer components.
2. Background Art
A computing device generally includes one or more processing units (e.g., a central processing unit (CPU), a graphics processing unit (GPU), a general purpose GPU (GPGPU), an accelerated processing unit (APU), or the like), that access a shared main memory. The processing units may execute programs (e.g., instructions or threads) that result in accesses to main memory. Because memory accesses may traverse a memory hierarchy including levels of cache and main memory, memory accesses may have different latencies, and may be performed in a different order than what was intended by the programs. In addition there may be conflicts, e.g., when two memory accesses attempt to store data in the same memory location.
Memory accesses are also called memory events, and examples include a store event (i.e., a memory access request to write data to main memory), a load event (i.e., a memory access request to read data from main memory), and synchronization events that are used to order conflicting memory events.
Memory consistency models provide rules for ordering memory events. A type of memory consistency model, release consistency with special accesses sequentially consistent (RCsc), provides a framework for event ordering for parallel programs with synchronization. Current systems that implement an RCsc memory model, a write-through (WT) memory system and a write-combining (WC) memory system, have difficulty with synchronization events such as a store release (StRel) synchronization event.
A StRel synchronization event is a release synchronizing store instruction that acts like an upward memory fence such that prior memory operations are visible to threads that share access to the ordering point before the store event portion of the StRel completes. A load acquire (LdAcq) synchronization event is a synchronizing load instruction that acts as downward memory fence such that later operations cannot occur before the LdAcq operation.
Upon executing a StRel synchronization event in a WT memory system, data is immediately written-through to main memory which is an inefficient use of the precious bandwidth resources to main memory. In addition, the system tracks acknowledgements for individual store completions which is highly inefficient. Further, upon receiving a load acquire synchronization event, the system performs a full cache flush to invalidate clean and potentially stale data which makes data reuse in the presence of synchronization impossible.
The WC memory system uses cache hierarchies to coalesce store events. Executing a StRel synchronization event in the WC triggers a slow and intensive cache flush to determine when the prior stores have completed to a next level of hierarchy. A cache flush entails walking through an entire cache hierarchy to track outstanding store events to completion.
In addition, write-combining caches incur overhead to track dirty bytes in cache lines in the memory hierarchy.
A hierarchical directory/snooping cache coherence protocol solution is a “read for ownership” solution that could support an RCsc memory consistency model, however, the memory access requests to write data encounter long delays. A requesting processor (e.g., a CPU or GPU) has to read or own a memory block before writing to local cache and completing a store event.