Modern servers with two or more processors employ architectures with multiple sockets, each with processor cores, memory, etc., that operate on a single motherboard. Some multi-socket architectures use a non-uniform memory architecture (NUMA) for memory access by the multiple processors of the multiple sockets. NUMA allows for processors on the different sockets to have access to a memory local to the socket, while also providing access to a shared pool of memory (e.g., the local memory of other sockets). The memory access times for the processor cores of the different sockets varies depending on the location of the memory relative to the socket (e.g., local memory accesses are faster than remote memory accesses). Accessing memory directly attached to the socket is faster than accessing memory in remote sockets because there is a performance penalty when traversing inter-CPU links (e.g., Intel QuickPath Interconnect (QPI)) to access memory in a remote socket.
In addition to local and remote memories, the sockets have a locality for other devices (e.g., network interface controllers (NICs), Peripheral Component Interconnect Express (PCIe) devices, etc.). In some cases, teaming is implemented for the devices of the multiple sockets, in which a group of the devices operates as a single logical element. For example, NIC teaming (or link aggregation) allows multiple NICs to operate as a single logical NIC, providing various benefits (e.g., bandwidth aggregation, link redundancy, and/or load balancing). NIC teaming can be implemented by physical switches, operating systems, hypervisors (e.g., VMWare's ESX hypervisor).
The locality of NICs (or other devices) to the sockets of a multi-socket architecture is an important characteristic to consider when configuring NIC team scheduling with high performance. For example, in a network input/output (I/O) application context, placing packets in memory attached to a local socket, processing them on local processor cores, and transmitting them on local NICs would be more efficient than a workload placement that involves cross-socket memory access. Existing load balancing algorithms/scheduling are not optimized for multi-socket architectures.