Computer technology continues to advance at a remarkable pace, with numerous improvements being made to the performance of both processing units—the “brains” of a computing system—and the memory that stores the data processed by a computing system.
In general, a processing unit (“CPU”) is a microprocessor or other integrated circuit that operates by executing a sequence of instructions that form a computer program. The instructions are typically stored in a memory system having a plurality of storage locations identified by unique memory addresses. The memory addresses collectively define a “memory address space,” representing an addressable range of memory regions that can be accessed by a microprocessor.
Parallel processing computing systems often include a plurality of nodes, where each node includes at least one CPU, and the plurality of nodes are interconnected, such that the computing nodes may transmit and receive data therebetween and also access memory connected to various nodes in the system. In a computing system with a plurality of CPUs and/or a plurality of nodes, a non-uniform memory access (“NUMA”) configuration may be utilized to effectively distribute the main memory across multiple computing nodes. In a typical NUMA configuration at least one CPU, one or more CPU caches, and a portion of the main memory (e.g., a set of dynamic random access (“DRAM”) memory devices) are connected to a memory bus to form a node. Typically, a plurality of nodes are connected by means of a high speed interconnect to form a NUMA configuration. The portion of the main memory resident on the same node as a CPU is typically considered to be the “local memory” for the CPU, while portions of main memory resident on other nodes are typically referred to as “remote memories” relative to the CPU. In a computer system with a NUMA configuration (a “NUMA system”), a data access by the a CPU that is satisfied by the contents of a local CPU cache or a local memory is referred to as a “local node” access. Accordingly, a “remote node” access is typically an access satisfied by accessing data that is stored on a remote node. Data accesses to remote nodes are associated with a very high latency as compared to local node accesses.
Typically, when a process is executed in a NUMA system the CPU executing the process accesses one or more memory locations to retrieve data required by the process. In NUMA systems the process typically executes faster if it is configured to perform the operations on a node with the data required for operation in a local memory. Likewise, a process executed in a NUMA system may execute slower if the process is configured to perform the task on a local node with the data required for operation in a remote memory due to the increased latency associated with accessing the remote node. Moreover, in highly distributed NUMA systems (i.e., NUMA systems with large amounts of interconnected nodes), latency associated with a node remotely accessing a first memory in a first remote node may differ as compared to the latency associated with remotely accessing a second memory in a second remote node due to transmission path length between the node and the respective remote nodes, the system resources configured on each respective remote node, the processes executing on each remote node at the time of remote access, other processes also attempting to remotely access each node at the time of remote access, and/or other such reasons.
As such, in distributed systems, including for example NUMA systems, not all remote memory locations have equal latency for all processors. In point of fact, physical and virtual locations of each processor in a particular node will create differences in efficient access to different areas in memory, including transferring data between caches associated with a specific processor or node. If two processors attempting to negotiate significant operations on memory segments are relatively “distant,” the operations by which they share and access these segments may be significantly less efficient.
Consequently, a need continues to exist for optimizing performance of a shared memory computer system that reduces memory access latency in existing systems.