High performance computer systems may utilize multiple processors to increase processing power. Processing workloads may be divided and distributed among the processors, thereby reducing execution time and increasing performance. One architectural model for high performance multiple processor system is the cache coherent Non-Uniform Memory Access (ccNUMA) model. Under the ccNUMA model, system resources such as processors and random access memory may be segmented into groups referred to as Locality Domains, also referred to as “nodes” or “cells”. Each node may comprise one or more processors and physical memory. A processor in a node may access the memory in its node, referred to as local memory, as well as memory in other nodes, referred to as remote memory.
In ccNUMA systems, there may be performance penalties for accessing the remote memory, and there may also be latencies associated with multiple programs or instruction streams attempting to simultaneously update the same memory locations. The latencies may derive from waiting for other programs or instruction streams to complete their updates or from the overhead associated with coherence protocols for the memory.