1. Field of the Invention
This invention relates generally to the field of computer processors. More particularly, the invention relates to an apparatus and method for selecting cache locality for atomic operations.
2. Description of the Related Art
In many instruction set architectures (ISAs), instructions are present to handle so-called “atomic” sequences. In an atomic sequence, an agent executes an operation on data in a manner that ensures that the agent has exclusive ownership of the data until the execution completes. Typically this can be implemented by a locking sequence in which a lock variable is associated with the data such that the agent first obtains exclusive access to the lock variable before accessing the data to be operated on to prevent other agents from accessing the corresponding data during the operation.
Some applications depend heavily on fine-grained atomic updates. The standard way to perform such updates is via a single atomic instruction (e.g., a LOCK-prefixed instruction such as “LOCK ADD”). For certain applications and/or inputs, these atomic updates may be highly contended, i.e., multiple cores may attempt to perform atomic updates on the same memory location simultaneously. The performance and energy efficiency in this situation is quite bad on certain processor architectures: each atomic operation requires a read-for-ownership operation to bring the cache line to a new core's L1 cache, including an invalidation of the cache line in the caches of all other cores. The latency of each atomic update may be significant, especially with larger numbers of cores, and the updates are serialized for contended locations, resulting in additional latency.
A previously-proposed solution to this is “in-memory atomics,” where an arithmetic logic unit (ALU) is added to the cache controller or tag directory controller external to the cores, which eliminates data movement from one cache to another, and thus provides better performance and lower energy consumption. However, there are numerous issues (hardware and software) with placing an ALU outside of a core, such as how to handle exceptions.
An alternative is “remote atomics,” where a thread requests another core to perform an atomic operation on its behalf. The idea with this implementation is to send the request to a core that is believed to currently hold a cache line identified by the address in question in its cache. This can provide the same benefits as in-memory atomics, with some simplification since existing ALUs may be used.
However, blindly using in-memory atomics and remote atomics at all times may have drawbacks. For example, in non-contended cases, where a line is re-used frequently, current approaches may be more efficient.