The core of a computer's operating system is known as the kernel. It performs a number of tasks central to the computer's operation including managing memory, files, and peripheral devices, launching application programs, and allocating system resources.
Programs interact with the kernel by invoking a well defined set of system calls. The system calls call functions within the kernel to perform various operations for the calling program such as displaying text or graphics or controlling a peripheral device. At a deeper level, kernel functions themselves may make further function calls within the kernel. One such further function call in some UNIX-based operating systems is kmem_alloc, which the kernel uses to call the kmem_alloc function to allocate memory needed for an operation the kernel is to perform. The kmem_alloc function, like the more familiar application-level malloc function, dynamically allocates memory for an executing process. The kmem_alloc function may be used, for example, to dynamically allocate memory for locks temporarily created by the operating system.
Memory allocation functions are useful for allocating memory in both single and multiprocessor computers. By definition, multiprocessor computers contain multiple processors that can execute multiple parts of a computer program or multiple distinct programs simultaneously, in a manner known as parallel computing. In general, multiprocessor computers execute multithreaded-programs or multiple single-threaded programs faster than conventional single processor computers, such as personal computers (PCs), that must execute programs sequentially. The actual performance advantage is a function of a number of factors, including the degree to which parts of a multithreaded-program or multiple distinct programs can be executed in parallel and the architecture of the particular multiprocessor computer at hand.
Multiprocessor computers may be classified by how they share information among the processors. Shared-memory multiprocessor computers offer a common physical memory address space that all processors can access. Multiple processes or multiple threads within the same process can communicate through shared variables in memory that allow them to read or write to the same memory location in the computer. Message passing multiprocessor computers, in contrast, have a separate memory space for each processor, requiring processes in such a system to communicate through explicit messages to each other.
Shared-memory multiprocessor computers may further be classified by how the memory is physically organized. In distributed shared-memory computers, the memory is divided into modules physically placed near each processor. Although all of the memory modules are globally accessible, a processor can access memory placed nearby faster than memory placed remotely. Because the memory access time differs based on memory location, distributed shared memory systems are often called non-uniform memory access (NUMA) machines. By contrast, in centralized shared-memory computers, the memory is physically in one location. Centralized shared-memory computers are called uniform memory access (UMA) machines because the memory is equidistant in time from each of the processors. Both forms of memory organization typically use high-speed cache memory in conjunction with main memory to reduce execution time.
Multiprocessor computers with distributed shared memory are often organized into multiple nodes with one or more processors per node. The nodes interface with each other through a memory-interconnect network by using a protocol, such as the protocol described in the Scalable Coherent Interface (SCI)(IEEE 1596). UMA machines typically use a bus for interconnecting all of the processors.
Further information on multiprocessor computer systems in general and NUMA machines in particular can be found in a number of works including Computer Architecture: A Quantitative Approach (2nd Ed. 1996), by D. Patterson and J. Hennessy, which is hereby incorporated by reference.
In a NUMA machine the memory is physically closer to a processor on the same node than a processor on another node. Consequently, processes run faster if their memory is placed on the node containing the processor running that process, since the processor and memory would not need to communicate between nodes. In a UMA machine, in contrast, the memory is substantially equidistant from all processors, and there is no performance advantage to placing a process's memory in any particular range of physical addresses.
A single operating system typically controls the operation of a multinode multiprocessor computer with distributed shared memory. Examples of suitable operating systems include UNIX-based operating systems such as DYNIX/ptx, BSD, SVR4, UnixWare, or PC UNIX. For background information on such operating systems, see Bach, M. J., The Design of the UNIX Operating System, Prentice-Hall, 1986; Vahalia, U., Unix Internals: The New Frontier, Prentice-Hall, 1996; McKusick, M., et al., The Design and Implementation of the 4.4 BSD Operating System, Addison-Wesley, 1996, which are all hereby incorporated by reference.
Conventional methods for kernel or application memory allocation in multiprocessor systems do not recognize the performance advantage inherent in NUMA systems. Memory is treated as a global resource, and these methods (implemented in kmem_alloc or equivalent functions) allocate memory without regard to where the memory is located within the multiprocessor system. As a result, the system as a whole operates more slowly than if physical memory location were taken into account.
A general objective of the invention, therefore, is to provide an efficient method and means for dynamically allocating memory among memory choices. More specifically, the objectives of the invention include:
1. Providing for allocation of memory on a specified node in a NUMA machine, such as the same node on which a process requiring the memory is running, to promote memory locality and low memory latency.
2. Providing for allocation of memory from a specific requested memory class. This allows drivers for devices with restricted DMA ranges to operate with dynamically allocated memory.
3. Providing for a default choice of node and memory class if none is explicitly specified.
4. Providing a new memory allocation function that is compatible with standard memory allocation functions so that the new memory allocation function may be used by software designed to operate on non-NUMA machines without changing that software.
5. Providing for limits on the amount of memory that may be consumed by a particular type of memory, without affecting the efficiency of common-case allocations.
6. Providing for the performance of lock-free common-case allocations and deallocations, while still allowing CPUs to extract memory from each other's pools in low-memory situations.
The foregoing and other objectives, features, and advantages of the invention will become more apparent from the following detailed description of a preferred embodiment which proceeds with reference to the accompanying drawings.