Multiprocessor computers by definition contain multiple processors that can execute multiple parts of a computer program and/or multiple distinct programs simultaneously, in a manner known as parallel computing. In general, multiprocessor computers execute multithreaded-programs and/or single-threaded programs faster than conventional single processor computers, such as personal computers (PCs), that must execute programs sequentially. The actual performance advantage is a function of a number of factors, including the degree to which parts of a multi-threaded program and/or multiple distinct programs can be executed in parallel and the architecture of the particular multiprocessor computer at hand.
Multiprocessor computers may be classified by how they share information among the processors and also whether they share memory or not. Shared-memory multiprocessor computers offer a common physical memory address space that all processors can access. Multiple processes and/or multiple threads within the same process can communicate through shared variables in memory that allow them to read or write to the same memory location in the computer. Message passing multiprocessor computers, in contrast, have a separate memory space for each processor or set of processors, requiring processes in such a system to communicate through explicit messages to each other.
Shared-memory multiprocessor computers may further be classified by whether all physical memory can be accessed by all CPUs in the same amount of time. The two classes of shared memory multiprocessors based on memory access time are called Uniform Memory Access (UMA) and Non-Uniform Memory Access (NUMA) machines. NUMA machines are often organized into multiple nodes with one or more processors per node. Although all of the memory is globally accessible in a NUMA machine, a processor can access memory in its local node faster than memory in a remote node and depending on the architecture, multiple levels of nodes may exist.
In UMA machines, all of the processors can access all of the physical memory in the same amount of time. Both forms of memory organization typically use high-speed cache memory in conjunction with main memory to reduce execution time.
FIG. 1 shows a computer system 100 which includes nodes 120 and 130, operating system 107, application 105, CPUs 122–123 and 132–133, I/O devices 150A–150B and memories 121 and 131. The illustrated operating system 107 is a UNIX-based operating system. The operating system 107 is a program stored in memory on one or more nodes 120 and 130. The computer system 100 shown in FIG. 1 is a UNIX-based machine that permits nodes to communicate with any other node.
In the computer system 100 shown in FIG. 1, the operating system 107 usually allocates memory for an application such that it will perform well without much or any knowledge of the application itself. This is straightforward on UMA machines where the latency to all memory is the same from the CPUs 122–132 in the system 100. With UMA machines, the operating system 107 can indiscriminately allocate memory from almost anywhere and provide the same performance for the application 105. In contrast, this becomes more difficult on machines with asymmetric memory hierarchies such as NUMA where the latency isn't the same to all memory from all CPUs 122–133 in the system. On such systems, the operating system may need to allocate the memory near the CPUs 122–133 that application 105 runs on for optimal performance.
Without knowing more about the application itself, the operating system in FIG. 1 becomes an unintelligent conduit to memory 121 and 131 and can only do so much. On NUMA systems, the operating system can try to always allocate memory near the application, but this may not be the right thing to do all the time for all applications. For example, this might only be good to do for a portion of the application's address space or the needs of the application may change over time. Thus, there are applications that could achieve more performance through better memory placement than what the operating system provides by default. Given this, the problem is coming up with the best way to inform the operating system of how the memory should be allocated to get the best performance in the application. Furthermore, placement optimization requires restricting locations of the processes.
In a prior art memory allocation system, such as system 100, an inflexible prescriptive memory allocation scheme is the way an application can effect how the operating system should allocate memory for the application. In such a memory allocation scheme, the application has to know memory characteristics of the particular computer system that the application is running on and the semantics of the memory allocation policies implemented by the operating system in order to optimally allocate the prescribed memory. This inflexible method of allocating memory restricts the ability of certain computer systems to run certain applications. And in the event that these computer systems run out of memory, the prescribed memory range is unavailable or the prescription may not apply to the computer that the application is currently running on rendering ineffective the applications memory prescriptive capabilities. The prescriptive method of allocating memory to applications may also result in a degradation of the overall performance of the underlying computer system.
Because of these issues and the relative costs of the various solutions approaches of the prior art, a simple solution which is viable to implement with minimal expense and which provides minimal complexity and maximum usability to the end user is needed. This solution should provide a platform independent scheme of allocating memory based on how an application uses memory rather than a rigid prescribed memory allocation policy.