As processors evolve to support more threads, synchronizing among those threads becomes increasingly expensive. This is particularly true for massively-threaded, throughput-oriented architectures, such as graphics processing units (GPUs), which do not support central processing unit (CPU)-style “read-for-ownership” coherence protocols. Instead, these systems maintain coherence by “pushing” data to a common level of the memory hierarchy, accessible by all threads, which acts as the global coherence point. After synchronizing, threads must then ensure that they “pull” data from this common memory level, (e.g., by invalidating their caches). For discrete and integrated GPU architectures, the global coherence point occurs at the last level cache (LLC) and at the memory controller, respectively, incurring very high latency. Many applications cannot amortize these high synchronization delays, limiting their performance on GPUs.
Scoped synchronization reduces synchronization latency by partitioning threads into sub-groups called scopes. Threads in the same scope can synchronize with each other through a common, but non-global (i.e., scoped) coherence point. For example, Heterogeneous System Architecture (HSA)—a state-of-the-art specification for devices like those including both a CPU and a GPU, (i.e., a heterogeneous architecture)—extends sequentially consistent for data-race-free (SC for DRF) memory models, (see for example Sarita V. Adve and Mark D. Hill, “Weak Ordering—A New Definition”, International Symposium on Computer Architecture (ISCA), June 1990), to include scopes. This new model is called sequentially consistent for heterogeneous race free (SC for HRF), (see for example Derek R. Hower, Blake A. Hechtman, Bradford M. Beckmann, Benedict R. Gaster, Mark D. Hill, Steven K. Reinhardt, David A. Wood, “Heterogeneous-race-free Memory Models,” The 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS-19), 2014). HSA, (see for example 3, HSA Foundation, “HSA Programmer's Reference Manual: HSAIL Virtual ISA and Programming Model, Compiler Writer's Guide, and Object Format (BRIG),” Publication #: 49828, Rev: Version 1.0 (Provisional), Issue Date: 5 Jun. 2014, [Online] Available: http://www.hsafoundation.com/standards/), and OpenCL 2.0, (see for example “OpenCL 2.0 Reference Pages,” [Online] Available: http://www.khronos.org/registry/cl/sdk/2.0/docs/man/xhtml/), introduced more general scoped synchronization primitives based on shared memory acquire/release semantics. For example, in HSA, synchronization operations are tagged with one of the following scope modifiers: work-item (wi) (i.e., GPU thread), wavefront (wv), work-group (wg), component (cmp), or system (sys). Scoped memory operations in HSA can be used to more quickly synchronize with a subset of threads (e.g., same GPU work group).
Scoped synchronization works well for static communication patterns, where producers and consumers have well-defined, stable relationships. That is, scoped memory operations work well for regular workloads where consumer threads are known. However, current SC for HRF models are unable to optimize some important communication patterns, like work stealing. In particular, it works poorly for emerging dynamic workloads, e.g. workloads that use work stealing, where a faster small scope cannot be used due to the rare possibility that the work is stolen by a thread in a distant, slower scope. For example, when a work-item accesses its local task queue, it would like to use a smaller scope (e.g., wg scope) to reduce synchronization overheads. However, when a work-item steals work from another task queue it must use a larger, common scope (e.g., cmp scope). Scoped synchronization requires producers to synchronize at a scope that encompasses all of their consumers. Thus, because a stealer can be any work-item in the GPU, a work-stealing runtime requires all task queue synchronization to occur at component scope. This means that smaller scopes cannot be used to optimize dynamic local sharing patterns like work stealing.
When using scoped memory operations programmers must manage both the visibility of data and the order that it is operated on. This is different than non-scoped operations, which only require programmers to reason about their order. In practice, shared memory models have accommodated scoped semantics, by applying scope modifiers to memory operations. As stated above, in HSA, memory operations are tagged with one of the following scope modifiers: wavefront (wv); work-group (wg); component (cmp); and system (sys). These scope modifiers are generic, meaning that they do not allow different scope instances to be distinguished. A scope instance is a particular instantiation of a scope. This means that for an update on a particular memory address to be visible to a thread, that update must have been “pushed” (i.e. released) to a scope that is associated with (i.e. visible to) that thread. This is because a thread cannot “pull” (i.e., acquire) from a scope-instance that it is not associated with. In other words, HSA defines push-pull semantics that require producers to push data to a scope that they share with their consumers.
These semantics make it difficult to optimize important communication patterns like work stealing, where consumers asynchronously read from producers. In a work-stealing runtime, producers do not know when a subset of their consumers (i.e. the stealers) will read their data; thus, they are forced to conservatively push their data to a scope that is visible to all of their consumers. Otherwise, a stealer may read a stale copy of the data.