Multiprocessor computer systems can be generally divided into two categories: systems with multiple processors either having a common shared memory and peripherals or having distributed memories and peripherals. Systems organized so that all processors have equal access to the peripheral devices and memories are known as symmetric multiprocessing (SMP) systems. The processors within an SMP system are connected to the shared memory and to each other via a common bus. A bus hierarchy may be used to connect the peripheral devices.
In non-uniform memory access (“NUMA”) computer architecture, memory access latencies are allowed to differ depending on processor and memory locations. All processors in a NUMA computer system continue to share system memory but the time required to access memory varies, i.e., is non-uniform, based on the processor and memory location. The main advantage of NUMA SMP designs over other alternatives like UMA SMP designs is scalability. Further, programming on NUMA SMPs is as simple as programming on traditional SMP shared memory. As a result, NUMA computer systems can run existing SMP applications without modifications.
In a NUMA computer system where processors and system memory are organized into two or more clusters or locality domains, each locality domain can include one or more processors which communicate with the local memory by means of a local bus. Each locality domain also includes a bridge for interconnecting the locality domain with other locality domains by means of a communication channel in order to form a network of intercommunicating locality domains. In such a multinode multiprocessor computer system, performance of a particular processor is always best if it accesses memory from its own local locality domain rather than from a remote locality domain, because it only requires access to the local bus.
A determination of underlying architecture and memory access patterns of all locality domains in a multinode multiprocessor computer system and exploiting the knowledge to optimally place program and data on a NUMA machine, can lead to significant performance gains. The system firmware generally contains topology information for all the processors and memories present in a multi processor environment during system reboot. Such topology information identifies the locality domains—groups of processors and associated memories in the system. This enables a tight coupling between the processors and the memory ranges in a locality domain and the operating system can use such affinity information to determine the allocation of memory resources and the scheduling of software threads to improve the system performance.
Current optimization techniques use such affinity information to better use locality domains to reduce memory access latency. For example, most operating systems provide a way to lock an entire process within a locality domain so that all threads of a process are able to share a common pool of memory that provides a substantially low amount of latency. If a process requires spanning across locality domains, the current techniques provide better memory access to different threads by splitting the thread accessed locality domains into local domain memory segments. While these techniques address data handling, they do not address instruction handling. In addition, current techniques do not partition code buffer based on locality domain and/or thread affinity in NUMA computer systems using such affinity information.