1. Field of the Invention
The present invention relates generally to allocation of memory in a computer system with distributed memory, and more particularly to a method for representing the locality of memory for a multi-processor non-uniform memory access (NUMA) computer system.
2. Related Art
A distributed memory computer system typically includes a plurality of physically distinct and separated processing nodes. Each node has one or more processors, input output (I/O) devices and main memory that can be accessed by any of the processors. The main memory is physically distributed among the processing nodes. In other words, each processing node includes a portion of the main memory. Thus, each processor has access to xe2x80x9clocalxe2x80x9d main memory (i.e., the portion of main memory that resides in the same processing node as the processor) and xe2x80x9cremotexe2x80x9d main memory (i.e., the portion of main memory that resides in other processing nodes).
For each processor, the latency associated with accessing local main memory is significantly less than the latency associated with accessing remote main memory. Further, for many NUMA systems, the latency associated with accessing remote memory increases as the topological distance between the node making a memory request (requesting node) and the node servicing the memory request (servicing node) increases. Accordingly, distributed memory computer systems as just described are said to represent non-uniform memory access (NUMA) computer systems.
In NUMA computer systems, it is desirable to store data in the portion of main memory that exists in the same processing node as the processor that most frequently accesses the data (or as close as possible to the processor that most frequently accesses the data). Accordingly, it is desirable to allocate memory as close as possible to the processing node that will be accessing the memory. By doing this, memory access latency is reduced and overall system performance is increased.
Therefore, controlling memory management is an essential feature in multi-processor systems employing NUMA architectures. In conventional systems, the operating system typically controls memory management functions on behalf of application programs. This is typically accomplished through the use of predetermined memory management procedures designed to produce a certain level of locality. For example, such procedures include program code to accomplish page migration and page replication. In this fashion, data is dynamically moved and/or replicated to different nodes depending on the current system state. However, such predetermined operating system procedures may not be optimal for all types of program applications.
Thus, what is needed is a system and method for producing a high degree of locality in a NUMA system that works well with a variety of different types of application programs.
Accordingly, the present invention is directed toward a memory management and control system that is selectable at the application level by an application programmer (also referred to herein as xe2x80x9cuserxe2x80x9d). The memory management and control system is based on the use of policy modules (PMs). PMs are used to specify and control different aspects of memory operations in NUMA computer systems. Policy modules are used to specify how memory is managed for processes (or xe2x80x9cthreadsxe2x80x9d) running in NUMA computer systems.
Preferably, each PM comprises a plurality of methods that are used to control a variety of memory operations. Such memory operations typically include initial memory placement, memory page size, a migration policy, a replication policy and a paging policy. In one example of an implementation of the present invention, different PMs are specified for particular sections of an application""s virtual address space.
In this manner, when a NUMA system needs to execute an operation to manage a particular section of an application""s virtual address space, it uses the methods provided by the policies specified by the PM that is currently connected (or attached) to the particular section of virtual address space.
In a preferred embodiment, the memory management and control system of the present invention provides application programmers with the ability to select different policies for different sections of the virtual address space down to the granularity of a single memory page. In one implementation, default policies are used each time a thread begins execution. The application programmer has the option to continue using the default policies or to specify different PMs comprising different methods.
One method typically contained in PMs is an initial placement policy (xe2x80x9cplacement policyxe2x80x9d). The placement policy defines algorithms used by a physical memory allocator (xe2x80x9cmemory schedulerxe2x80x9d), to determine what memory source is to be used for allocating memory pages. The goal of the placement policy is to place memory is such a way that local accesses are maximized.
In a preferred embodiment of the present invention, placement policies are based on two abstractions of physical memory nodes. These two abstractions are referred to herein as xe2x80x9cMemory Locality Domainsxe2x80x9d (MLDs) and xe2x80x9cMemory Locality Domain Setsxe2x80x9d (MLDSETs). One advantage to using MLDs and MLDSETs is that they facilitate the portability of application programs. That is, by specifying MLDs and MLDSETs, rather than physical memory nodes, application programs can be executed on different computer systems regardless of the particular node configuration and physical node topology employed by the system. Further, such application programs can be run on different machines without the need for code modification and/or re-compiling.
MLDs are specified as having a center node and a particular radius. Thus, a particular MLD with a center C and a radius R is a source of physical memory comprising all memory nodes within a xe2x80x9chop distancexe2x80x9d (described below) of R from a center node located at C. Generally, an application programmer defining MLDs specifies the MLD radius and lets the operating system determine the center node. The center node is typically based on a number of factors and includes additional parameters that are specified by the application programmer. Such additional parameters include configuration topology and input/output (I/O) device affinity.
For example, MLDSETs allow an application programmer to specify a device affinity for one or more MLDs. Thus, if an application program is associated with a particular I/O device, such device is specified when creating an MLDSET that will be associated with that section of a thread""s virtual address space comprising code that interacts with the particular I/O device. In this fashion the operating system automatically places the application code associated with the I/O device as close as possible to the node containing the I/O device.
In addition, MLDSETs allow an application programmer to specify a specific topology for MLDs. For example, an application programmer can specify that the MLDs comprising a particular MLDSET should be arranged in a cube or a cluster topology. In addition, application programmers also have the option to specify physical nodes for the placement of MLDSETs.
Further features and advantages of the present invention, as well as the structure and operation of various embodiments of the present invention, are described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers indicate identical or functionally similar elements.