Modern computer servers may typically have a Non-Uniform Memory Access (NUMA) architecture that locates various subsets of system memory near subsets of the system CPU cores. NUMA is a computer memory design used in multiprocessor systems, where the memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory, that is, memory local to another processor or memory shared between processors. NUMA attempts to address the problem of processors starved for data due to waiting on memory accesses to complete. NUMA provides for separate memory allocations for each processor (or group of processors) in a multiprocessor system, thereby avoiding the performance degradation when several processors attempt to address the same memory. Each grouping of the processors and their associated connected memory is known as a NUMA node. A set of CPU cores and their associated local memory are referred to as a NUMA “node.”
Usually, each CPU core present in the system may have some associated memory local to that CPU core. Local memory for each CPU core means the system topology may introduce complexities, such as non-trivial depth. A “flat” or unstructured system topology may have uniform memory access times, regardless of the CPU core in use. However, it is increasingly difficult to scale flat system architectures to capacities expected in modern systems. A NUMA system usually has multiple memory latency times, reflective of the two or more levels in its system topology. Applications running on NUMA systems which do not use memory local to their CPU cores may experience sub-optimal memory latency when accessing memory data from remote NUMA nodes.
Operating system process schedulers typically optimize for CPU utilization, rather than for CPU core and memory affinity. This means that application jobs are frequently scheduled across various CPU cores on a NUMA system, without sufficient regard for the impact on application memory latency.
Currently, knowledgeable performance experts can manually place and bind application processes to system resources that are best suited for the specific server topology to optimize CPU core and memory affinity, and thus improve the system and application performance. Some users might research and read documentation and performance tuning white papers to learn how to do expert NUMA tuning. Other users might hire consulting performance experts to do custom tuning and application load balancing analysis for their production environments.