The appendix contains 4308 lines of source code in C++ programming language.
The invention relates to memory allocation in computer systems and more particularly to dynamic memory allocation in single-thread and parallel (multi-process/multi-thread) environments.
Today, most modern programming languages, including C and C++, allow the user to request memory blocks from the system memory at run-time and release these blocks back to the system memory when the program no longer needs them. This is known as dynamic memory allocation. The part of the programming environment that serves the dynamic memory allocation requests is known as heap manager or dynamic memory allocator. Typically such services are provided by the operating system, by the standard (compiler) run-time libraries, by third-party libraries, or by combination of the above.
The C programming language provides memory management capability with a set of library functions known as xe2x80x9cmemory allocationxe2x80x9d routines. The most basic memory allocation function is called xe2x80x9cmallocxe2x80x9d which allocates a requested number of bytes and returns a pointer that is the starting address of the memory allocated. A function known as xe2x80x9cfreexe2x80x9d returns the memory previously allocated by xe2x80x9cmallocxe2x80x9d back to the system so that it can be allocated again later for use by other routines. Another frequently used function is xe2x80x9creallocxe2x80x9d which changes the size and possibly the address of an allocated memory block, while preserving its contents.
The requirements for dynamic memory allocators are these: high speed, low maximal response time, high memory utilization, scalability with CPU count, and threads count independence. A desirable feature is the simplicity of their usage, preferably in a form, equivalent to the de-facto standard C language memory allocation functions.
Many dynamic allocators for single-thread environments have been designed over the years. A representative survey on this work have been done by Wilson, Johnstone, Neely and Boles in their paper xe2x80x9cDynamic Storage Allocation: A Survey and Critical Reviewxe2x80x9d, 1995. This survey does not cover the aspects of memory allocation specific for multithread environments. On the other hand, almost every multithread allocator uses some form of underlying monolithic (single-thread) memory allocator in its implementation.
Some of the memory allocators, known in the prior art, use deferred free block coalescing and storing free blocks in segregated lists of blocks of the same fixed size. They define a plurality of predetermined fixed block sizes. Each block list contains free blocks of the same sizexe2x80x94one of the predetermined fixed sizes. In the prior art terminology these lists are also known as xe2x80x9cbinsxe2x80x9d, xe2x80x9cquick listsxe2x80x9d, xe2x80x9cblock queuesxe2x80x9d, or xe2x80x9cfast listsxe2x80x9d. In some variants blocks, bigger than a predetermined maximal block size, are managed by a separate method. Additional method is used for coalescing and splitting of blocks, when necessary.
The operation of the allocators of the aforementioned type includes a step of determining of the fixed size, corresponding to certain block size. Depending on what kind of request is being served (allocation or disallocation), the fixed size to be found is either the smallest one that is not less than the requested size, or the greatest one that is not greater than the requested size.
There are several methods for determining such a size. Some methods include storing of the fixed sizes in an array in e.g. increasing order and searching the array, either sequentially, or by binary search, for the size, satisfying the search condition. For instance, the xe2x80x9cHoardxe2x80x9d allocator version 2.0.5.1 by Emery Berger, uses sequential search.
These search methods require several steps, which take time. This time adds to the time of the allocation request.
An obvious way to speed up this search is using a table of size equal to the maximal fixed size, filling each entry in table with the fixed size, corresponding to the index of the entry, and obtaining the fixed size, corresponding to a block size, by just taking the value in the table at position equal to the block size. This method is very fast and typically takes only several CPU instructions. Obviously, the same techniques may be used for obtaining the index of a fixed size rather than the fixed size itself, if this is more appropriate.
A drawback of this method is that the table that is needed might be quite big, relative to the size of the CPU cache. Typical values are 5 Kbytes for the table, corresponding to a maximal fixed block size of 5 Kbytes, and 32 Kbytes for data cache in a CPU.
The relatively big table size is a reason for increased CPU cache miss ratio, which in turn is a reason for deteriorated performance of the memory allocation operations, using this techniques.
Another method for determining such a fixed size index is presented in U.S. Pat. No. 5,623,654, 1997, Peterman. It uses table of limited size (e.g. 1024) and storing the free blocks of size having the same residue by module the table size to the same entry in the table. This method still requires several steps for determination of the fixed size, corresponding to a block size.
The blocks, residing in any particular free block list, are unavailable for anything else except for allocating blocks of size equal or smaller than the fixed size for this list. It would be desirable that the count of the blocks in any given free list is big enough, so that most allocation requests are satisfied by taking blocks from it, rather than from the splitting/coalescing part of the allocator. On the other hand, this count should be small, so that the memory overhead is low. It would also be desirable that the aforementioned count changes dynamically in accordance with the current program workload; i.e. provided that there is sufficiently long period of no allocation/disallocation activity for the particular fixed size, the entire list should be purged to the coalescing level, this way making the memory available for allocation as blocks of other sizes, or to other programs.
U.S. Pat. No. 5,109,336, 1992, Guenther et al, discloses a method for continuously purging unused blocks from the free lists in order to improve the memory usage. The method disclosed, though, does not provide any correlation between the program workload and the counts of the blocks remaining in the free block lists.
In general, using of segregated fixed size block lists with deferred coalescing did not find wide support in the monolithic (single-thread) memory allocators in the prior art. Some of the reasons for this fact are the unsolved problems, mentioned above.
In parallel computing environment with shared memory, e.g. multithread programs, there are more than one threads of execution of programming instructions that run simultaneously. Although typically threads are scheduled to run concurrently on the available CPUs, from the end-user""s point of view they appear to run in parallel. In multi-processor systems this parallelism is real. Typically, each thread has a set of associated system resources (registers, stack, local thread data, etc.), accessible by this thread only. There are also resources, among which is the dynamically allocated memory, that are shared among the threads.
In parallel environments, methods are needed to ensure that the threads coordinate their activities, so that one thread does not accidentally change the data (or other type of object) that another thread is working on. This is known as access serialization. Serialization is achieved by providing means that limit the number of threads that access the same object concurrently.
For purposes of serialization, traditionally the multithread programming environments use the notion of a xe2x80x9cmutexxe2x80x9d (an abbreviation for xe2x80x9cmutual exclusive lockxe2x80x9d). A mutex is an object that can be acquired by at most one thread at a time. The operation of acquiring of the mutex causes the calling thread to wait until it (the mutex) becomes available, and to only proceed farther after mutex""s successful acquisition.
When certain shared object needs to be accessed exclusively by one thread at a time, the traditional approach is to use a mutex, associated with the object. The first thread to access the object acquires the associated mutex, accesses the object, and when done, releases the mutex. If, during this time, a second thread tries to access the object, it waits until the mutex is released by the first thread, and only after its successful acquisition proceeds farther with the access. This way, no two threads can be active on the shared object at a time.
Using mutexes is expensive (i.e. slow) for two reasons: a) the contention of multiple threads for the mutex, and b) the high cost of obtaining the mutex even if there is no contention for it.
One way to solve the problem with the high contention rate is to design the programs in such a way that exclusively a single xe2x80x9cowningxe2x80x9d thread accesses a particular object. This solution solves both the contention rate and the mutex accusation cost problems, mentioned above, because there is no need for serialization at all.
There are situations, though, in which it""s not possible to establish truly private ownership of some particular object by any given thread. In these cases, even if one (xe2x80x9cowningxe2x80x9d) thread accesses the object xe2x80x9calmostxe2x80x9d exclusively, some other (xe2x80x9calienxe2x80x9d) thread may still need to access it, even if quite occasionally. In such cases, no matter how infrequent the alien thread accesses the object, there is a need for serialization of the access to it.
In the prior art access to a shared the object was serialized by a mutex, no matter how asymmetrically (i.e. with different frequencies) different threads accessed particular shared object.
There are cases in which the time needed to acquire or release a mutex constitutes a significant part of, or even exceeds, the time needed for performing of the actual object access. This way, the mutex""s acquisition and releasing becomes a performance bottleneck. In such situations, a serialization method, requiting no mutexes, would be of considerable benefit.
Methods for memory allocation and disallocation, which perform well in single-thread environment, are typically not adequate when used directly in multithread one. For instance, immediate coalescing of free blocks, which was found to be quite suitable for single-threaded environments, imposes using of a single common mutex for serialization of the access of all the threads to the memory allocator data structures. It causes considerable performance degradation for memory allocation intensive applications that are performed by several threads in parallel, due to the high contention rate on the common mutex.
Some methods for memory allocation, focused specifically on multithread environments, are disclosed in the academic literature and in several US patents. These techniques found place in various multithread environment memory allocators: a) using fixed sized blocks and deferred, or no coalescing; b) using of multiple pools of free memory, each mapped to preferably different xe2x80x9cowningxe2x80x9d thread; c) transferring several blocks between thread private pool and the common storage at a time at the cost of a single mutex acquisition.
In U.S. Pat. No. 6,058,460, 2000, Nakhimovsky discloses an allocator, using a predetermined amount of pools (blocks of continuos storage of predetermined size) for memory management. Mutexes are used to serialize the access to the pools in order to avoid conflicts in cases when either two threads are mapped to the same pool, or when a block, allocated by one thread, is disallocated by another.
The allocator described in this patent has several weak points. When two or more threads map to the same pool, there is contention for this pool, with all caused by it negative consequences. The probability this to happen grows when the threads count increases, because the amount of pools is predetermined and presumably small. To decrease this probability, more memory pools have to be established, this way deteriorating the memory utilization. But even if each thread is mapped to a separate pool, contention still exists in cases when a thread releases blocks allocated from another thread. A scenario of very low memory utilization is possible also: allocating of a single small block from a pool of size e.g. 64 Kbytes.
In xe2x80x9cMemory Allocation for Long-Running Server Applicationsxe2x80x9d, ISSM 98, available on Internet at www.acm.org/pubs/citations/proceedings/plan/286860/p176-larson, Larson and Krishnan describe an allocator featuring multiple heaps. A heap controls a continues memory area (stripe) of a predetermined size. Each heap is organized as a set of segregated by size free block lists. The access to the individual lists is serialized using mutexes; there are no other locks. The amount of heaps is predetermined (e.g. 10), and conceptually smaller than the amount of the threads. Each running thread is mapped in a pseudo-random way to one of the heaps.
The allocator disclosed by Larson and Krishnan suffers performance degradation in cases when two or more threads map to the same heap (which happens inevitably once the threads count exceeds the heaps count) due to contention for the respective heaps and lists. Another drawback of this allocator is its low memory utilization, caused by the fact that heaps manage predetermined and relatively big areas (stripes) of memory.
Another memory allocator for multithread environments is the Hoard allocator by E. Berger, 1999. It uses kind of fixed-sized free block lists, organized as xe2x80x9csuperblocksxe2x80x9dxe2x80x94continuous blocks of storage of predetermined size. For each fixed size each thread has a private heapxe2x80x94a set of superblocks. The thread""s private heap is used when allocating blocks by each thread. Each thread can release blocks, allocated by another thread, serializing the access to the corresponding private heap by using an associated with it mutex. The amount of blocks in a heap is limited to twice the predetermined xe2x80x9csuperblockxe2x80x9d size. If as a result of a disallocation the maximal allowed per-heap memory size is exceeded, the emptiest superblock from the heap is transferred to the global heap.
The Hoard allocator has these drawbacks: in a case when a small block of size e.g. 1 byte is allocated from a xe2x80x9csuperblockxe2x80x9d of size e.g. 8 Kbytes, the memory utilization is very low. Scenario yielding bad performance is possible: contention for private heaps occurs when threads disallocate blocks being allocated by another threads. The Hoard allocator uses a linear search for determination of the fixed size class corresponding to the requested block size, which requires up to the predetermined fixed sizes count steps upon each allocation.
In xe2x80x9cA Scalable and Efficient Storage Allocator on Shared-Memory Multiprocessorsxe2x80x9d, COMPSPAC""98. Vee and Hsu describe an allocator for blocks of one fixed size. It uses private (per thread) pools. Each pool consists of an active and backup lists. Each list can contain at maximum a predetermined amount of blocks. The backup list in a pool is used to store blocks only if the active list is full. If both lists are full at the end of a disallocation, one of them is transferred at the cost of a single operation and mutex acquisition to a global (for all threads) stack of full lists. If both lists in a pool are empty upon allocation, one of them is refilled at the cost of a single mutex acquisition from the global stack of full lists.
According to the algorithm, described in the cited paper, each pool will contain, on average, an amount of blocks equal to the aforementioned predetermined maximal amount of blocks in a list. Unless this amount is very small, this causes a relatively big memory overhead. On the other side, if the predetermined maximal list length were small, it would mean that there are more frequent refills and purges of private pool lists to/from the global stack of full lists. This can cause high contention rate for the global stack, which is serialized by a common mutex. Additionally, there is no block coalescing foreseen, which makes the potential memory overhead even higher, and makes the algorithm inapplicable as an universal dynamic memory allocator.
None of the allocators using fixed-sized free block lists (or equivalent constructs), known in the prior art, makes any attempt to keep the length of these lists at some kind of optimum, depending on the behavior of the end-user application, i.e. the allocation and disallocation frequencies. In some situations this is a reason for low memory utilization, while in others it is a reason for bad performance.
The present invention is a method of efficiently managing working storage in multi-thread (parallel) environments by combining private (per thread) sets of fixed-size blocks lists and an external splitting and coalescing general-purpose allocator. A global, common for all threads set of free block lists, makes efficient transferring of free blocks from one thread to another.
Blocks of size greater than the maximal fixed size are managed directly by the external allocator.
The lengths of fixed-size lists are changed dynamically on per thread and per size class basis in accordance with the allocation and disallocation workload.
A service thread performs regular updates of all the lists and collects the memory associated with terminated user threads. Mutex-free serialization method utilizing thread suspension is used in the process.
Several objects and advantages of the present invention are:
a) to reduce the amount of processor times required to allocate, disallocate and reallocate a block of memory in multithreaded and single-threaded environments;
b) to reduce the amount of total working memory required, particularly during periods of low workload;
c) to achieve performance scalability as the amount of available CPUs increases;
d) to eliminate performance degradation, caused by thread serialization issues, as the amount of user threads or CPUs increases;
e) to eliminate performance degradation and memory wastage in situations, in which one thread disallocates blocks that were allocated by another thread;
f) to reduce memory wastage as the amounts of CPUs and user threads increase;
g) to use simple and standard end-user interface.