In data processing systems that have multiple processors and a shared memory system, some operations that are performed on data items stored in the memory system may be restricted to being performed atomically so that the operation sequence of reading the memory location performing the operation and then writing the results back to the same memory location are performed such that they appear as if the sequence has either not been performed at all or has been performed in its entirety. In other words they are executed as one indivisible unit, hence the name atomic. This means that other operations that might be accessing the same memory location cannot access the data during the operation. Indeed in some systems, where the system cannot determine whether an access is to the same memory location, all memory accesses are blocked during the operation. Generally atomic operations are simple arithmetic operations that are performed on the data value. These arithmetic operations are often both commutative and associative, such that a change in order of the operations does not change the result. These operations typically come in two variants “Return” and “NoReturn”, the Return variant returning to the program the original data value that was present prior to the operation being performed and the NoReturn not returning a data value. In some modern languages such as OpenCL which is becoming popular for parallel processing, these operations do not come in two variants only the Return variant exists. This is possible, as in the case that the value is not required, the returned value is simply not used.
A problem that arises with atomics is that if multiple atomics are performed on the same memory location, they need to be serialised such that the observable semantic that they have each been performed in full or not at all is maintained. For systems with a single or dual cores this is not a significant problem. However, with modern systems having ever increasing numbers of cores, serialising the operation of multiple cores reduces their performance to that of a single core, or in some cases even worse than this. This is because where different cores are accessing the same memory location, the data will need to be loaded from memory to the local cache and then back to memory for each core. When doing computations on a graphics processor for example, when an atomic instruction in a program is executed, it is typically started for all the threads in different cores at the same or similar times. This can lead to all of the atomics requiring access to the same address, or at least some of them doing so, at similar times. In such cases the accessed cache-line needs to be ping-ponged between the cores. This is slow and uses a lot of power. Given that all the threads are likely to be doing atomics at the same time then the cores will stall.
It would be desirable to be able to improve the performance of multiple cores performing operations, in particular atomic operations, to the same memory locations.