Modern computer systems can split executing programs into two or more threads of operation (which are referred to as ‘threads’) for more efficient processing. Single-processor systems execute multiple threads, which is called ‘multithreading,’ by periodically switching between the threads. Systems with multiple processors or processor cores (henceforth referred to collectively as ‘processing elements’), can execute multiple threads simultaneously on different processing elements. Such functionality, which is called simultaneous multithreading (SMT), is gaining popularity as the computer industry turns to multi-processor or multi-core systems for improved performance.
SMT complicates memory access because significant increases in memory bandwidth and more-efficient data-sharing techniques are often needed to support coherence and atomicity. Efforts to overcome this problem include cache-coherent non-uniform memory-access (ccNUMA), which safely coordinate data accesses in systems with multiple SMT processors. These efforts fall short because of significant data-transfer inefficiencies and latency overheads. For example, passing shared data between multiple processors can consume a large percentage of the multi-processing bandwidth, especially when highly contended data is forwarded to a processor that is attempting to execute an operation involving the data. Therefore, there is a need for techniques and systems that support improve protocols for coherence and atomic operations.