1. Field of the Invention
This invention relates generally to memory management on multi-processor computer systems and, more specifically, to a system and method for efficient memory heap management in Non-Uniform Memory Access systems.
2. Description of the Related Art
Writing software for multi-processor systems is not an easy task. In an ideal scenario, as the total number of processors increase in a system, the throughput of an application would also scale proportionally. However, this is rarely the case in practice. Thread synchronization and accessing shared resources can cause portions of a program to execute serially, and possibly produce bottlenecks. For example, when multiple processors use the same bus to access the memory, the bus can become saturated. As the number of processors in the system increases, the available memory bandwidth to each processor decreases. Ideally, doubling the number of processors should double the performance, but this is almost never the case. In fact, in many scenarios, increasing the number of processors in the system may cause performance degradation.
Some traditional systems are based on a Uniform Memory Access (UMA) shared memory architecture, such as the common bus-based symmetric multiprocessing (SMP) systems where multiple processors access the memory via a shared bus. The memory access time for any processor is the same but the shared memory bus can become a major performance bottleneck. Processor manufacturers have traditionally attempted to mitigate the bottleneck by increasing processor cache sizes. Large caches increase the chance that the processor will find the data it needs in the local cache and may not have to access memory at all. Unfortunately, a large data cache may not be a general solution to the memory bottleneck problem as some memory intensive applications may use large areas of memory that do not fit in the available cache. In such cases the memory bottleneck difficulties remain. Further, the problem may worsen as the number of processors connected to the shared bus increases.
Another approach for reducing the shared memory bus bottleneck is through the use of Non-Uniform Memory Access (NUMA) system architecture. In the NUMA architecture, node may comprise a processor coupled to local memory. There may also be a mechanism allowing one processor to access memory connected to another processor. Typically a processor may access its local memory (i.e., memory connected directly to the processor) faster than it may access remote memory (i.e., memory connected to another processor, on another node). An important challenge with NUMA architectures is controlling where the memory for data and code is allocated. However, carefully managing memory is an added burden for programmers. Implementing well-performing software solutions can be a very challenging task for a number of technical reasons, which is why many real-world software application developers often choose to ignore the problem.
Operating systems (OSs) provide multiple application programming interfaces (APIs) for memory allocation and management. Unfortunately, these APIs are not always very efficient. In general, making an OS API call is expensive because of the context switch between user mode and the system kernel. Further, the APIs often have limitations such as large minimum allocation size. For example an API function may always allocate a whole page (4 KB) of memory even if the caller requested a much smaller size. This poses a serious problem for applications that frequently allocate and release small memory blocks.
To solve these problems, programmers typically use heap memory manager libraries. The standard C/C++ libraries for most popular C/C++ compilers include such heap manager implementations, but there are also many other 3rd party options. A typical heap memory manager uses the OS API to allocate large memory blocks at a time and divides these blocks into smaller parts to satisfy memory requests by the calling program. This reduces the cost of API call overhead. For example, an application may make 1000 calls to allocate 64 bytes of data, but the heap manager may make only a single OS API call to allocate a large memory block (e.g., 1 MB), and carve out 64 bytes of memory for each 64 byte request. When the initial pool of 1 MB of memory is used up, the heap manager may typically make another OS API call to allocate more memory. The heap memory manager can also allow applications to allocate memory blocks of smaller size, which may help reduce the waste of memory due to fragmentation.
Modern operating systems use virtual memory and give applications limited control over the mapping of virtual to physical memory. In such systems, when an application allocates a memory block (using an OS API call or a heap memory manager), it is assigned a virtual memory region. The OS maps that virtual memory region to a physical memory location, but the OS typically retains a full control over when that happens or what physical memory range to use.
Modern operating systems such as Microsoft Windows and Linux use a “first touch” policy. This means that when an application or heap manager requests memory, the virtual address is initially not mapped to any physical memory. When a program thread first accesses the memory (read or write), the OS allocates a physical memory region and maps the virtual address to a physical range. The OS typically allocates physical memory from the NUMA node that is executing the thread which first accessed the virtual memory block. There are additional tools to help programmers better control the memory allocations and thread execution on NUMA systems. For example Microsoft Windows Vista™ provides an API that allows an application to allocate memory on a given node.
Unfortunately, both approaches have limitations that can adversely affect performance. Using these operating system APIs means that the programmer cannot use, and will lose, the benefits of heap memory managers provided in the C Runtime libraries (CRT), which may result in a high cost for memory allocation/management and potentially high memory fragmentation. Conversely, using a traditional heap memory manager, the programmer may not be able to control the location of memory allocations, resulting in degraded application performance due to a high volume of remote memory accesses.