In many instruction set architectures (ISAs), instructions are present to handle so-called atomic sequences. In an atomic sequence, an agent executes an operation on data in a manner that ensures that the agent has exclusive ownership of the data until the execution completes. Typically this can be implemented by a locking sequence in which a lock variable is associated with the data such that the agent first obtains exclusive access to the lock variable before accessing the data to be operated on to prevent other agents from accessing the corresponding data during the operation.
There are two typical methods for handling contended atomics, namely local operation or remote execution. The first method is fast for execution of the atomic operation, but has a high overhead cost due to cacheline bouncing and coherence, yielding a low bandwidth to the contended data. The second method has a poor latency for the atomic operation, but has a high bandwidth to the contended data.
On-die contention over atomic sequences (via critical regions or other constructs) is typically left to the programmer to manage explicitly. By careful instrumentation of the original program, granularity of atomic operations is reduced and contention may be minimized. However, such performance tuning efforts are not generally scalable from one class of machine to another. Furthermore, the careful instrumentation requires excessive attention by the programmer to develop code that accounts for atomic operations, which generally requires excessive programmer effort and does not scale well to different machines. For example, programmer-written code that may avoid contention in a processor having two cores may not scale very well to a many-core implementation in which many cores each of which can execute multiple threads are present.