An almost universal goal of computer scientists and engineers is to increase processing speed. One way to do this is to have more processors at work simultaneously, hence, developments such as parallel and multi-core (such as SMP—Symmetric Multi-Processing) architectures. FIG. 1 illustrates a simplified schematic of a quad-core SMP system on a single socket 10, in which four processor cores 20a, . . . , 20d, share a set of memory devices 40i, . . . , 40iv, with memory access being coordinated by a memory controller 30. In other words, as FIG. 1 illustrates, more than one processor (or, equivalently, processor core) may contend for access to the shared resource, in this case, memory (RAM).
Since the switch to ubiquitous multi-core architectures, it has become clear that scalability lies in multithreaded programming. It is not uncommon, for example, for workloads to run dozens of threads executing in parallel. On the operating system level, there may be hundreds of processes executing at the same time, taking advantage of the multiple cores available on the CPU (or multiple CPUs in such architectures), and of technology such as HyperThreading, which allows for a single physical core to expose multiple logical cores to the system to maximize its utilization. In recent years, however, the single memory bus available in traditional SMP systems has increasingly been regarded as a major performance bottleneck. In other words, contention for the single shared resource has caused performance to suffer.
One attempt to alleviate the memory bottleneck involves complex cache hierarchies in hardware. Despite this, many workloads are still reliant on memory, which remains the main cause of execution slow-down. As a result of high access latency, a CPU can thus become “starved for memory”. In other words, no further instructions can be executed until data has been retrieved from memory. While already a concern on single-core CPUs, this problem is all the worse in multi-core systems, in which not one but many cores can stall at once waiting for memory access due to access latency or the limited bandwidth available on the memory bus. This issue undermines the benefits of concurrent execution, and only worsens with the increase in the number of cores on a CPU. Therefore, a new, more scalable architecture was necessary to extract the full benefits of multi-core parallelism.
This need led to the rise of non-uniform memory access (NUMA) architectures. These systems are more scalable, as they consist of multiple sockets or “nodes,” each of which has a possibly multi-core CPU, a local memory controller and local RAM. Nodes are linked through high-speed interconnects. FIG. 2 illustrates a simplified four-node (Socket 0, Socket 1, Socket 2, Socket 3) NUMA system, in which each node has the general structure of the single node shown in FIG. 1, and in which the different socket pairs are linked via respective high-speed interconnects 200a, . . . , 200d. 
To understand the concept of NUMA, imagine students sitting studying at respective tables in a library, where each table may have room for more than one student to sit: If the books on each table are the ones that the students sitting there most need to read, then there will be less need to walk around to get them. All books will be available, but a student might need to walk to some other table to get a book that isn't at his own table. It will be faster to get books from adjacent tables, and will take longer if he must walk to tables farther away. Depending on the library, he might even need to go to the general stacks to get still other books, or request assistance from a librarian.
Similarly, the general idea behind NUMA systems is that memory assigned to each node should ideally contain the information most needed by the processor cores in that node; thus, the most needed memory contents will be “local” for those nodes and can be accessed faster, using a bus associated with each respective socket/node. Information stored in the memory associated with other nodes is “remote”—it can be accessed, but more slowly. If a given node is connected to another by one of the high-speed interconnects, then information can be transferred between the memory associated with the respective nodes faster than otherwise, but still not as fast as within a node. In some cases, a core in one node needs access to memory associated with a node with which its node does not have a direct high-speed interconnect. If no general bus is included, then a “hop” will be required via nodes that are interconnected. For example, Socket 0 in FIG. 2 could get data from the RAM associated with Socket 3 by hopping via Socket 2. In short, in a NUMA system, processors can access the memory local to their own respective nodes faster than memory local to another processor or memory shared between processors. Despite being distributed throughout the system, memory in NUMA is thus still typically presented to the programmer as a global, shared address space: Any memory location can be accessed by any CPU, although some accesses (local ones) can complete faster than others.
Note that high-speed interconnects could also be implemented for each diagonal pair of sockets, that is, connecting Socket 0 with Socket 3, and Socket 1 with socket 2. This would eliminate the “hop” (with performance degradation) between otherwise non-interconnected nodes. Interconnects are hardware structures, however, so each such interconnect complicates the architecture. In order to extract performance benefits from the non-uniform memory layout, it is therefore important to maximize memory locality on such systems—high numbers of remote accesses can severely degrade performance, in comparison to traditional SMP systems.
As with regular SMP, memory performance in a NUMA system may be improved by the use of a hierarchy of caches at each node. Note that initial NUMA designs did not implement cache coherence across nodes, which meant processors were not guaranteed to retrieve the latest updated data in case the memory reference they were accessing was found in their local cache, but had already been modified on another node. Although easier to design and manufacture, this model was found to prohibitively increase the complexity of programming for such systems. As a result, nowadays NUMA machines are typically (but not necessarily) implied to be ccNUMA (cache-coherent NUMA).
Under NUMA, memory references from a CPU's point of view can be divided into remote ones, which reside on other nodes, and local ones, which are stored in the CPU's local bank. When a CPU accesses memory, it first queries its local caches. If no level in the hierarchy contains the required data and the address is local, it will be retrieved from the local RAM. On the other hand, if it is remote, the CPU has to stall while memory is accessed over the high-speed interconnect. (Note that a CPU might also stall even for local memory accesses.) The non-uniform characteristics for NUMA systems are due to the increased latency penalty incurred when going over the interconnect.
Clearly, NUMA will favor some types of workloads over others. For example, workloads with small working sets that can be mostly contained in caches should generally not experience substantial slowdowns due to the distributed nature of the system. For memory-intensive workloads, however, good performance can typically be achieved only if the data can be spread across the system such that each processor can load data only (or at least predominantly) from its local bank and thus avoid expensive (time-consuming) remote accesses. Unfortunately, due to the dynamics of CPU scheduling, load-balancing, memory allocations, and several other factors, achieving sufficient locality of accesses in the general case is difficult. Different operating systems have taken different approaches.
With NUMA, the proper “positioning” of data and code in the overall memory system thus becomes essential. In particular, the number of remote accesses by each processor should ideally be minimized, or else not only would any potential advantages of NUMA be negated, but performance might suffer even further than on a symmetric architecture due to the high interconnect latency. This “locality” problem can be addressed in a variety of ways, none of which are mutually exclusive.
A third approach would be to include optimizations at the operating system level. This is a particularly attractive option, as the OS controls every layer of execution and has full knowledge of the topology it runs on, as well as the current state of the system in traditional SMP systems (the most common form of UMA, or Uniform Memory Access architecture), all processors (or cores) share one memory bus, and therefore have uniform access time to all of memory. The main focus of modern operating systems' memory management modules is their paging policy: which pages to fetch into memory, which frame to load them into, and which pages to swap to disk in order to make room for new ones. The most attention is typically given to the algorithm for selection of pages to swap in/out, to reduce the occurrence of problems such as thrashing, where the same pages continuously get pushed to disk and accessed soon afterwards, bringing about a heavy performance hit.
With the advent of NUMA, new aspects need to be considered. For example, the importance of memory placement has risen dramatically, so which pages to fetch matters just as much as where in memory these pages are loaded. What is more, it is no longer enough to fetch a page and keep it in memory if it is accessed frequently. Often, processes will be scheduled to run on various nodes rather than stick to a single one, depending on the load distribution in the system; consequently, memory that was once local to a process may suddenly become remote. Dynamic detection of changes in locality and proactive migration of pages, as well as locality-aware scheduling, are therefore needed to keep performance high.