Modern multi-CPU architectures can achieve very good performance if the majority of memory access is to that of memory in cache, or at least not in the cache of another CPU. If more than one CPU accesses the same memory address, or an address on the same cache line (typically 64 bytes), then this will provoke cache misses and cache contention, which not only limit per-thread throughput, but also reduce scalability.
As CPUs continue their trend of getting wider but not much faster, scalability becomes more important than “straight-line” performance. In recent years, the goal of “avoiding operations such as mutexes,” while still having an element of truth, is better restated as “avoiding operations on mutexes that have contention.” As applications change to accommodate the changing CPU landscape, applications are also tending to increase the number of threads that they use.
The number of CPUs available to a program can vary dramatically. The same program may be run in a constrained Virtual Machine environment, where only one or two CPU cores are available to it, or can be run in an architecture where it is expected to scale across large servers with over 100 processing cores. Operating systems have also become more sophisticated, using techniques such as maintaining affinity between threads and CPU cores automatically, and when allocating memory, being aware of the Non-Uniform Memory Architecture (NUMA) of the host, and thus allocating memory that is local to the CPU that is allocating it.
Memory allocators have a number of competing trade-offs to make and/or perform quickly for individual calls, to scale well across many threads, to be memory efficient, and they must behave correctly even when called from multiple threads concurrently. Further, it can be important to allocate memory in such a way that avoids problems such as false sharing, where two threads at the same time both use blocks of memory that reside in the cache line and thus provoke cache collisions. Applications may allocate memory on one thread, access it on that thread and free it on that thread, or may allocate on one thread and then later access the memory and free it from other threads. This is a common situation in message-passing applications.
U.S. Pat. No. 6,427,195 B1 describes a very widely used technique of having per-thread “free-lists”. When allocating memory, if an object is available on the thread's local free-list, then a block of memory is removed from the free-list and returned to the application. When freeing a block of memory, it is added to the thread's free-list. A free-list has a maximum size and attempts to free when the free-list is full would then use a global allocation strategy, such as a global free-list, which may require taking locks which could be contended. Similarly, when allocating, if the local free-list is empty, a global allocation strategy is used (“Multi-arena” allocator in U.S. Pat. No. 6,427,195 B1).
Per thread pools that scale with the number of threads is an increasing trend. Thus, the amount of “cached” memory across all threads is increasing. This is memory that is held by the per-thread free-lists, and thus not being used to hold application data, but is not available for the operating system to re-use. Previous solutions can easily lead to a relatively large amount of cached memory across all threads. They also do not address the common use cases of transferring objects from one thread to another very well, as memory is allocated on producer threads and then freed on consumer threads. This results in the producer emptying its local free-list, and the consumer filling its free-list. As a consequence, there is excessive memory usage and decreased cache efficiency. When many threads are acting as consumers and producers in an application, there is no guarantee that a block of memory will be reused by the same threads or hardware CPU cores. Such schemes also encourage allocated blocks of memory to “migrate” across CPU cores, thus the CPU core that may become the primary user of a block of memory may not be the one that allocated the memory, and thus that CPU core may be using memory that is not local to the CPU. There are practical concerns as well with schemes such as those disclosed in U.S. Pat. No. 6,427,195 B1. These schemes require initializing and releasing data structures at the beginning and ending of every thread. This requires cooperation from the thread library and/or across all libraries within a process. This is difficult and cumbersome to do in a cross-platform way, and reduces the performance of starting threads and terminating threads.