The present invention generally relates to computer operating systems. More particularly the present invention relates to performance optimization of multi-threaded application programs by reducing the need for mutual exclusion locking and/or the need for defragmentation operations, i.e., coalescence operations.
A single-tasking operating systems are inefficient because computer programs or program subroutines are executed serially, i.e., no computer program or program subroutine can begin to execute until the previous one terminates. Inefficiencies inherent in such single-tasking operating system led to the development of multitasking or multithreaded operating systems. In these latter operating systems, each computer program, or process, being executed may comprise one or more sub-processes (sometimes referred to as threads). A multi-tasking operating system allows more than one thread to run concurrently. Modem operating systems include user space memory allocators, e.g., malloc family of function calls, to manage allocation/deallocation of memory.
While it is significantly more efficient than a single-tasking operating systems, a multi-tasking operating system requires a significant number of new features in order to allow an orderly processing of multiple threads at the same time. One of the special requirements is safeguarding against corruption of memory as a result of a contentious accessing of a memory location by more than one thread.
In particular, one known conventional solution for the above safeguarding is by the use of a mutual exclusion (MUTEX) lock. Typically a region of the memory is identified as a critical region that contains critical data structures which could become corrupted if they were manipulated concurrently by multiple threads. A MUTEX lock is given to an owner thread currently operating in the critical region, and prevents other threads from executing in the critical region while the owner thread is executing. When the owner thread is no longer operating in the critical region, the MUTEX lock is released to allow another thread to take ownership of the critical region. The use of a MUTEX lock thus maintains integrity of the data structures in the critical region.
However, MUTEX locking exacts a significant performance cost. Firstly, all other threads are prevented from executing in the critical region while the MUTEX is locked. This means that any other thread that attempts to execute in the critical region must wait until the lock is released (e.g., in case of a binary MUTEX) before entering the region. The idling delay of the threads while waiting for the release of the MUTEX lock is sometimes referred to as the performance cost ofxe2x80x9cMUTEX contentionxe2x80x9d.
Secondly, the time necessary to acquire and release the MUTEX lock in-and-of themselves may be significant performance cost. For example, even when only a single thread is running at the time, in order to access the critical region, the MUTEX lock must nevertheless be acquired and released, thus adding delay. This delay associated with acquisition and release of MUTEX lock is sometimes referred to as the performance cost of xe2x80x9cMUTEX lockingxe2x80x9d.
In application program interfaces (API) using memory allocators, e.g., a malloc, the performance cost of both the MUTEX contention and MUTEX locking can be very large for applications that use the APIs intensively. In a simplistic locking scheme, e.g. locking a MUTEX around all the code of the API, effectively only one thread can use the API at a time. This can lead to unacceptably poor performance of a multi-threaded program. As can be appreciated, minimizing MUTEX contentions and MUTEX locking is crucial in optimizing performance of multi-threaded applications.
Prior known attempts to reduce MUTEX contentions in a memory allocator, e.g., a malloc family of function calls, include multiple heaps (or arenas) and multiple free lists. In the multi-arena solution, multiple memory allocation arenas, each having its own MUTEX lock, are maintained. In addition, each thread is assigned to one of the multiple memory allocation arena. Because fewer threads are assigned to a single arena, fewer threads are likely to contend for the same MUTEX lock. The number of arenas may be tunable to allow the application program to control how many threads are contending for each MUTEX lock.
While the multi-arena solution does reduces MUTEX contentions, it does not completely eliminate them since there are still some contentions occurring within each arena. Possibly, MUTEX contentions may be eliminated if each thread is given its own arena. However, since each arena grows separately, increasing the number of arenas can significantly increase memory consumption of a multithreaded application. Thus, per-thread-arena is not a practical solution in a typical multi-thread applications. Moreover, in the per-thread-arena solution may not eliminate the performance cost of MUTEX locking.
Another problem associated with conventional memory allocators, e.g., a malloc, is the performance cost associated with coalescence of freed memory blocks. Over a period of time of accessing a memory, the available memory space can be fragmented, i.e., exists in small blocks. When new data is stored in the scattered fragmented small blocks of memory, it takes longer time to access the newly stored data than if the data was stored in a single contiguous block of memory.
Conventionally, the memory fragmentation problem is addressed by coalescing freed blocks with free neighboring blocks (if there are any). The resulting coalesced block has a larger size than when it first became free, and thus it requires a rearrangement of the data structure (e.g., the free list employed to keep track of bocks of a given size) to reflect the existence of the new block with the new size and the removal of the smaller blocks that were coalesced into the new block.
This rearranging is very time consuming since it involves searching the free list for the insertion point of the new block, and for the blocks to be removed. Coalescence is thus a fairly expensive operation performance wise. A number of conventional algorithms have been devised to attempt to reduce performance costs associated with coalescence operations. For example, conventional small block allocators (SBA) do not coalesce freed blocks in satisfying requests for small blocks of memory. Additionally, prior attempts, e.g., Cartesian trees, have been made to reduce the search times of the free list.
Unfortunately, however, even with these prior attempts, conventional memory allocators have not been able to eliminate the need for coalescence operations, and thus still suffer from a significant performance loss due to coalescence operations.
Thus, what is needed is an efficient system for and method of memory allocation, which further reduces the MUTEX contentions and/or the MUTEX locking, and thus the performance losses attendant thereto.
What is also needed is an efficient system and method for memory allocation, which further reduces the need for coalescence operations, and the performance losses attendant thereto.
In accordance with the principles of the present invention, a method of allocating a block of memory in response to a memory allocation request from a thread in a multi-threaded operating system, comprises providing a cache slot being private to the thread, the cache slot having cached therein the block of memory previously freed by the thread, determining whether the memory allocation request can be satisfied out of the cache slot, and if the memory allocation request can be satisfied out of the cache slot, satisfying the memory allocation request out of the cache slot.
In addition, in accordance with the principles of the present invention, a computer readable storage medium having stored thereon a computer program for implementing a method of allocating a block of memory in response to a memory allocation request from a thread in a multi-threaded operating system, the computer program comprising a set of instructions for providing a cache slot being private to the thread, the cache slot having cached therein the block of memory previously freed by the thread, determining whether the memory allocation request can be satisfied out of the cache slot, and if the memory allocation request can be satisfied out of the cache slot, satisfying the memory allocation request out of the cache slot.