1. Field of the Invention
This invention relates to the field of memory management and more particularly to consumer and producer elimination.
2. Description of the Related Art
Elimination is a parallelization technique that allows producer and consumer processes or threads to transfer data without having to access a shared central data structures. For example, the shared queue structure of a first-in-first-out (FIFO) queue implementation is an example of such a shared central data structure. A traditional FIFO queue supports an enqueue operation, which places a value at the end of a shared central queue, and a dequeue operation, which removes the first value (if any) from the front of the central queue. Concurrent FIFO queues are widely used in a variety of systems and applications. For example, queues are an essential building block of concurrent data structure libraries such as JSR-166, the Java® Concurrency Package described in Concurrent Hash Map in JSR166 Concurrency Utilities by D. Lea, which can be found online at gee.cs.oswego.edu/dl/concurrency-interest/index.html (please note spaces have been included in the above URL reference to prevent it from being recognized as an active link in document viewing programs).
However, as the number of processes that enqueue and dequeue data on the FIFO queue increases, the more times an individual process may be blocked due to locking preventing simultaneous access to the main FIFO queue. Similar situations may occur when multiple processes attempt to use a single, shared central stack, or a last-in-first-out (LIFO) queue, as well. Elimination allows matching pairs of producer and consumer processes (or threads) to transfer data when a central data structure, such as a stack or queue is blocked due to concurrent access.
In recent years, progress has been made towards practical lock-free FIFO queue implementations that avoid the numerous problems associated with traditional lock-based implementations. In particular, the lock-free FIFO queue implementation of Michael and Scott described in Simple, fast, and practical nonblocking and blocking concurrent queue algorithms in Proceedings of the 15th Annual ACM Symposium on Principles of Distributed Computing, pages 219-228 (1996) (hereinafter referred to as MS-queue) outperforms previous concurrent FIFO queues across a wide variety of loads. A key reason is that the MS-queue algorithm allows enqueue and dequeue operations to complete without synchronizing with each other when the queue is nonempty. In contrast, many previous algorithms have the disadvantage that enqueue operations interfere with concurrent dequeue operations. Nonetheless, the MS-queue still requires concurrent enqueue operations to synchronize with each other, and similarly for concurrent dequeue operations. As a result, as the number of threads concurrently accessing the central queue increases, the head and the tail of the queue become bottlenecks, and performance generally suffers. Therefore, while the MS-queue algorithm provides good performance on small-to-medium machines, it does not scale well to larger machines.
Various elimination techniques are available. For example, Shavit and Touitou introduce an elimination technique in Theory of Computing Systems, 30:645-670 (1997). Their elimination technique is used to implement a scalable stack and is based on the observation that if a push operation on a stack is immediately followed by a pop operation, then there is no net effect on the contents of the stack. Therefore, if a push and pop operation can somehow “pair up” the pop operation can return the element being added by the push operation, and both operations can return, without making any modification to the stack. In other words, we “pretend” that the two operations instantaneously pushed the value onto the stack and then popped it. The mechanism by which push and pop operations can pair up without synchronizing on centralized data allows exploitation of this observation.
Shavit and Touitou implement a stack that uses a tree-like structure that operations use to attempt to pair up and eliminate each other. The implementation is lock-free and scalable, but is not linearizable, which is discussed in Linearizability: A Correctness Condition for Concurrent Objects by M. Herlihy and J. Wing in ACM Transaction on Programming Languages and Systems, 12(3):462-492 (1990).
Shavit and Zemach introduced combining funnels in Combining funnels: a dynamic approach to software combining, Journal of Parallel Distributed Computing, 60(11):1355-1387 (2000), and used them to provide scalable stack implementations. Combining funnels employ both combining and elimination to achieve good scalability. They improve on elimination trees by being linearizable, but unfortunately they are blocking.
Both the elimination tree approach and the combining funnels approach are directed at scalability under high load, but their performance is generally substantially worse than other stack implementations under low loads. This may be a significant disadvantage, as it is often difficult to predict load ahead of time and load may vary over the lifetime of a particular data structure.
Additionally, Hendler, Shavit, and Yerushalmi introduced a scalable stack implementation in A Scalable Lock-Free Stack Algorithm, Proceedings of the 16th Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 206-215, ACM Press (2004). This scalable stack implementation performs well at low loads as well as being scalable under increasing load. Their implementation builds on a conventional (non-scalable) lock-free stack implementation. Such implementations typically use an optimistic style in which an operation succeeds if it does not encounter interference, and retries when it does. Since repeated retries on such stacks can lead to very poor performance, they are typically used with some form of backoff technique, in which operations wait for some time before retrying. However, as the load increases the amount of backoff required to reduce contention increases. Thus, while backoff can improve performance, it does not achieve good scalability.
The stack implementation introduced by Hendler et al. includes a stack (the central stack) and a “collision array.” Operations first attempt to access the conventional stack, but in the case of interference, rather than simply waiting for some time and then trying again, they attempt to find another operation to pair up with for elimination. To achieve good scalability, this pairing up must be achieved without synchronizing on any centralized data. Therefore, a push operation chooses a location at random in the collision array and attempts to “meet” a pop operation at that location. Under higher load, the probability of meeting an operation with which to eliminate is higher, and this is key to achieving good scalability. However, the implementation introduced by Hendler at al. does not satisfy the properties of a FIFO queue.