Modern computer systems typically comprise at least one multiple-core central processing unit and, increasingly, at least one multiple-core graphics processing unit, with the latter being programmable to perform useful non-graphics tasks through heterogeneous computing frameworks such as CUDA and OpenCL. Due to the parallelism enabled by such systems, computer programs are increasingly designed to generate multiple program threads—ranging from a handful to thousands—in order to carry out sets of tasks which may be run relatively independently from one another and scheduled for concurrent execution. Examples of programs adopting multiple-threaded designs include web servers, databases, financial analytics applications, scientific and engineering analytics applications, and the like.
Specialized memory organization schemes can be useful in such systems since contention for access to the program heap can be so costly. Dynamic memory allocation can be one of the most ubiquitous operations in an application, with up to 30% of program execution time being spent in allocation and deallocation operations in certain benchmark applications. Frequent locking of the program heap during dynamic allocation operations also leads to poor scaling in multiple-threaded designs. Memory allocation modules focusing upon this problem generally use an organizational architecture pioneered by Hoard that provides a public, global memory heap for access by all threads as well as private, thread-local memory heaps for access by individual threads. [1] Thread-local memory heaps (each hereinafter a “local heap”) are created to meet much of the program memory demand without requiring the use of memory locks or transactional memory mechanisms to protect a heap against modification by other concurrently-executing threads. The global memory heap (hereinafter a “global heap”) is used to hold any global variables or large data structures as well as to provide a cache of memory allocatable for use in local heaps. Performance of the allocator can still be important since contention for operations involving the global heap—fetch operations requesting chunks of memory for local heaps and return operations releasing chunks of memory back to the global heap—will similarly delay the execution of allocator-invoking threads. As shown in FIG. 1, a thread must still invoke the allocator to transfer chunks of memory to and from the global heap, and the allocator must still use memory locks, transactional memory mechanisms, or the like in order to maintain coherency of the global heap while completing such transfer operations. Accordingly, contention for the allocator—or, more strictly speaking, contention for responses to requests relating to management of the global heap—remains a problem in such architectures. TCMalloc (thread-caching malloc) is a well-known example of an allocator using a Horde-like memory organization architecture. [2]
Memory allocation modules comprehensively addressing the problem of contention for the allocator are relatively unknown. In most existing allocators performance tuning is largely left to experts who devise default parameters based upon broad assumptions concerning program behavior and performance. For example, Doug Lea engineered his classic dlmalloc allocator so that “[if] configured using default settings [it] should perform well across a wide range of real loads.” [3] dlmalloc allows those default settings to be modified via a mallopt call supporting programmer-specifiable parameters such as the size of an “arena” (the size of chunks of memory that are to be requested from the operating system for use by the program) and the size of an “arena cache” (the number of allocated-but-free chunks to be held for program reuse rather than immediate return to the operating system), but that capability is used infrequently and on an ad hoc basis. Due to the increasing importance of thread-level concurrency, various next-generation parallel memory allocators, some of which use sophisticated and highly tunable heuristics, are being developed. But these allocators tend to follow dlmalloc in pursuing uniform performance across wide ranges of loads. When the level of concurrency varies greatly, e.g., from a few threads to several hundred or more, there typically will not be a single set of parameters that consistently works well. Thus there is a need for a program memory allocation module which may be readily controlled based upon easily understood parameters.