Atomics are one of the fundamental synchronization techniques in modern multicore central processing units (CPUs). These operations update a memory location such that the operation appears indivisible. The x86 instruction set architecture (ISA) provides two types of atomics—direct-fetch and compare-and-swap (CAS). Fetch-atomics apply an indivisible update directly on a memory address, but they are only defined for integer values and a limited set of update operations. CAS can be applied to various data types and support a variety of update operations. To achieve this, the CAS operation loads a memory address, updates the value and writes this result to the memory address, if the value at the memory address has not been changed in the meantime. If the value has been changed, the CAS operation has to retry. In contrast, a fetch-atomic locks the cache line that will be updated during the complete update from the first load until the result is written to the memory.
In a multi-threaded environment with a single shared address space not only the atomicity of updates is important, but also the order in which they become visible to other threads. Thus, programming languages like C++ provide options to specify in which order atomics can become visible and how they can be reordered. ISAs provide ordering guarantees or mechanisms (e.g., fences) to implement the desired memory ordering. The guarantees made at programming language level not necessarily have to match the guarantees at ISA level, as long as the ISA guarantees are stronger. For example, X86 is restrictive as an atomic cannot be reordered with any other memory operation (loads and stores). As a consequence, even a relaxed atomic at C++ level is often executed with stronger guarantees by the architecture.
To complement automatic hardware pre-fetching, ISAs like x86 or ARMv8-A provide pre-fetch instructions to partially or completely hide memory access latency. These pre-fetch instructions can provide additional information about an optimal cache level, if there is temporal reuse, or which type of operation (read/write) will be executed.
However, in comparison to a load, a pre-fetch does not change the state of the program as it only interacts with the cache. When a thread writes to a memory address that another thread had successfully pre-fetched, but not loaded, the cache coherence protocol simply invalidates the pre-fetched entry. While load and store operations on x86 are serialized for atomics, nothing indicates that this also holds true for pre-fetches.