Business applications like transaction processing require multiprocessor systems which can execute a large number of relatively independent threads. Computer systems using multiple processors have existed for decades in various forms, the most common of which have been multiprocessing servers and mainframes. The advent of inexpensive, high-performance processors has provided impetus to the development of multiprocessor designs.
A common architecture in the art has been referred to as Symmetrical Multiprocessing (SMP). Multiple processors are, by definition, “symmetrical” if any of them can execute any given function. On simple SMP systems, each processor has equal access to all of the system memory via a centralized, shared memory controller. The “cost” of a memory access is statistically uniform across the SMP address space, since the memory-access average latency for each processor is substantially the same.
Because each processor also maintains its own on-board data cache, frequent data exchanges between processors are required to make sure the caches and memory are kept synchronized. These housekeeping transactions consume processor cycles, which is one reason that SMP performance does not scale linearly with the number of processors. Another reason is that all data fetched from memory must travel to the processors via a single memory bus. With only one bus to handle the data needs of multiple processors, the memory bus can become a serious bottleneck as the number of processors increases.
Designers in the mid-1990s developed a Non-Uniform Memory Access (NUMA) scheme. In this model, the processors are provided direct access to a private area of main memory. These processors can access the private “local” memory via a dedicated memory controller without using the system bus, whereas other processors must use the bus to access the private memory of another processor. The global memory space is divided into constituent memory domains and the latency to local memory is much lower than the latency to the memory on another processor. This scheme is “non-uniform” because memory is accessed differently depending on its location.
Since NUMA processors can access their local data directly, the number of processors a system can support without a significant memory bottleneck is significantly greater. In addition, because these processors still share a single globally shared memory space, the system appears to user applications as one homogeneous memory area.
While the “cost” of a memory access with respect to the execution pipeline in a NUMA system is non-uniform, conventional replacement policies continue to be employed in the caches of the individual processors.
The figures are not drawn to scale.