This invention relates generally to multiprocessor computer systems having shared memory distributed among a multiple of nodes. More specifically, this invention relates to the physical placement of memory among the nodes of a multinode computer system, when allocated in response to processor faults. By controlling the physical placement of the memory so allocated among the nodes of the system, references local to each node than would otherwise by the case using a naive or other physical memory placement policy. The increased memory locality in turn yields a commensurate improvement in the overall performance of the system.
Multiprocessor computers by definition contain multiple processors that can execute multiple parts of a computer program and/or multiple distinct programs simultaneously, in a manner known as parallel computing. In general, multiprocessor computers execute multithreaded-programs and/or single-threaded programs faster than conventional single processor computers, such as personal computers (PCs), that must execute programs sequentially. The actual performance advantage is a function of a number of factors, including the degree to which parts of a multithreaded-program and/or multiple distinct programs can be executed in parallel and the architecture of the particular multiprocessor computer at hand.
Multiprocessor computers may be classified by how they share information among the processors. Shared-memory multiprocessor computers offer a common physical memory address space that all processors can access. Multiple processes and/or multiple threads within the same process can communicate through shared variables in memory that allow them to read or write to the same memory location in the computer. Message passing multiprocessor computers, in contrast, have a separate memory space for each processor, requiring processes in such a system to communicate through explicit messages to each other.
Shared-memory multiprocessor computers may further be classified by how the memory is physically organized. In distributed shared-memory computers, the memory is divided into modules physically placed near each processor. Although all of the memory modules are globally accessible, a processor can access memory placed nearby faster than memory placed remotely. Because the memory access time differs based on memory location, distributed shared memory systems are often called non-uniform memory access (NUMA) machines. By contrast, in centralized shard-memory computers, the memory is physically in one location. Centralized shared-memory computers are called uniform memory access (UMA) machines because the memory is equidistant in time from each of the processors. Both forms of memory organization typically use high-speed cache memory in conjunction with main memory to reduce execution time.
Multiprocessor computers with distributed shared memory are often organized into multiple nodes with one or more processors per node. The nodes interface with each other through a memory-interconnect network by using a protocol, such as the protocol described in the Scalable Coherent Interface (SCI)(IEEE 1596). UMA machines typically use a bus for interconnecting all of the processors.
further information on multiprocessor computer systems in general and NUMA machines in particular can be found in a number of works including Computer architecture: A Quantitative Approach (2nd Ed. 1996), by D. Patterson and J. Hennessy, which is hereby incorporated by reference.
While NUMA machines offer significant advantages over UMA machines in terms of bandwidth, they face the prospect of increased delay in some instances if their operating systems do not take into account the physical division of memory. For example, in responding to a system call by a process (a part of a computer program in execution) for allocating physical memory, conventional operating systems do not consider the node location of the process, the amount of free memory on each node, or a possible preference by the process for memory on a specific node in responding to the request. The operating system simply allocates memory for the shared memory object from its global free list of memory. This can result in the process making multiple accesses to remote nodes if the memory is not allocated on the process""s node. Or it can result in continual process faults such as page faults and movement of processes into and out of memory (xe2x80x9cswappingxe2x80x9d) if the memory is allocated on a node that has little free memory.
An objective of the invention, therefore, is to provide a method for allocating memory in a multinode multiprocessor system which responds to the communicated physical placement needs of the application program requesting the memory. The program is created by a user such as a computer programmer, and it is believed that the user in many situations knows best how the program should run in the system, and where the physical memory used by the program should be placed.
A method according to the invention enables an application program (i.e., a user process) to specify a policy for allocating physical memory on a node of a multinode multiprocessor computer system for the program. The memory is then dynamically allocated, when needed, in accordance with the specified policy.
According to the invention, the computer operating system receives a request from an application program to create, or reserve, a portion of virtual address space and to allocate, in accordance with a policy specified by the program, physical memory on a node as a result of a subsequent reference to the virtual address space portion. In response to the request, the operating system crates the virtual address space portion. In response to a subsequent reference to the virtual address space portion by an application program, the physical memory is allocated on a node in accordance with the specified policy for association with the virtual address space portion. The set of nodes on which memory must be allocated in accordance with the policy may also be specified. Alternatively, the physical memory can be allocated at the time the operating system responds to the request.
Related to the method is a data structure for controlling the allocation of memory in accordance with an allocation policy specified by the application program.
In a preferred embodiment of the invention, an application program can specify through means such as a system call to the operating system that physical pages of memory for an application-specified portion of virtual address space are to be physically allocated upon a specified set of nodes within the multinode computer system. This allocation is subject to the additional selection criteria that the pages are to be allocated at first reference upon: 1) the node upon which the reference first occurs; 2) the node which has the most free memory, or 3) that the pages should be evenly distributed across the indicated set of nodes. In effect, the operating system remembers the specified allocation policy and node set from which the physical pages can be subsequently allocated, as established by the system call. Subsequent use of the virtual address space for which the allocation policy is defined results in the memory being allocated accordingly. In this way, an application program can use memory with the memory-locality most advantageous to it. Of course, other selection criteria than the above three may be used.
The preferred embodiments of the invention include extensions to the mmap and shmget functions of UNIX-based operating systems.
The foregoing and other objects, features, and advantages of the invention will become more apparent from the following detailed description of a preferred embodiment which proceeds with reference to the accompanying drawings.