The present invention relates generally to computer systems and, particularly to a method and system for implementing concurrent array-based data structures such as queues, stacks and double-ended queues.
A concurrent data structure refers to a data structure used concurrently by multiple application threads. Concurrent accesses to the concurrent data structure have to be synchronized to avoid corrupting the data structure or its contents.
The concurrent data structures discussed in this disclosure are stacks, queues and deques. A deque is a double-ended queue similar to an ordinary queue, except that the deque allows inserting and deleting from the front and back.
In an array-based concurrent data structure, each element or object in the data structure is an element in the array. An element in the array might store the data object or might store a pointer to the data object. The maximum number of objects in the data structure is given by the number of elements in the array. At any given instance, each array element stores either nothing or a single data object of the application. To have a single terminology across the various data structures, a thread is said to put an object into the data structure. Such a thread is said to be a putter. A thread is said to take an object from the data structure. Such a thread is said to be a taker. After an object is taken from the data structure, the corresponding array element stores no object and thus is free and available for some future thread wishing to put an object. Thus, as threads put objects into and take objects from the data structure, each element of the array is used and re-used for different objects. In other words, successive objects pass through each element of the array.
In applications on multiprocessor systems, a common performance bottleneck occurs due to concurrent array-based data structures such as a concurrent queue, deque or stack. Thus, it is desirable to provide a method and system to improve the performance of concurrent array-based data structures (e.g., make the data structure have faster access time or increase a throughput of the data structure).
According to the publication by M. Michael and M. Scott in Nonblocking algorithms and preemption-safe locking on multiprogrammed shared—memory multiprocessors, Journal of Parallel and Distributed Computing, 51(1):1-26, 1998:                In general, efficient data-structure-specific nonblocking algorithms outperform both ordinary and preemption-safe lock-based alternatives . . . . An implementation of a data structure is nonblocking (also known as lock-free) if it guarantees that at least one process of those trying to update the data structure concurrently will succeed in completing its operation within a bounded amount of time, assuming that at least one process is active, regardless of the state of other processes. Nonblocking algorithms do not require any communication with the kernel and by definition they cannot use mutual exclusion . . . . No practical nonblocking implementations for array-based stacks or circular queues have been proposed. Using general methodologies would result in inefficient algorithms. For these data structures lock-based algorithms have been the only option.        
Thus, a practical fast nonblocking implementation of array-based concurrent stacks, queues and deques would be novel and desirable.
For array-based concurrent stacks, queues and deques, practical prior art implementations are blocking. That is, a putter or taker locks the entire data structure to block other putters or takers. This results in low performance since it limits concurrent operations to one. While other concurrent data structures such as priority queue heaps use nonblocking implementations by locking individual elements of the data structure, no locking of individual elements are known to have been done for practical concurrent stacks, queues and deques.
For array-based stacks, queues and deques, practical high performance by locking individual elements of the data structure would be novel and desirable.
In prior art, synchronized access to shared data is often done using a ticket lock. A ticket lock is a form of inter-thread synchronization. The principles of a ticket lock can be analogized to a scenario in which a person at a service counter initially receives a unique ticket number from a dispenser and then waits until that number is served. For array-based stacks, queues and deques, practical high performance by using a ticket lock per element of the data structure would be novel and desirable.
Concurrent data structures are implemented using synchronization primitives. Examples include various forms of fetch-and-operate. Such a fetch-and-operate primitive atomically reads, modifies and writes a memory location. Known fetch-and-operate primitives include test-and-set, fetch-and-store (also known as swap), fetch-and-add, fetch-and-increment, store-add and compare-and-swap.
If multiple threads concurrently execute fetch-and-increment to the same memory location, the values returned are consecutive integers. These values can then be used as indices into an array with the assurance that each array element is assigned to exactly one process. Fetch-and-increment has been used to implement an array-based queue. A memory location is used to generate producer indices into the array. Another memory location is used to generate consumer indices into the array. A short-coming of that approach is that fetch-and-increment on its own allows a consumer to be assigned to an element for which no producer has yet been assigned. Accordingly, an improved synchronization primitive which prevents a consumer to be assigned to an element for which no producer has yet been assigned is desirable.
Fetch-and-increment may be relatively easily implemented in computer hardware by having a processor core issue a normal load to a special memory address. The memory subsystem recognizes the special address and performs the fetch-and-increment. When many threads concurrently issue fetch-and-increment to the same memory location, such a hardware implementation in the memory subsystem can satisfy a fetch-and-increment operation every few processor clock cycles. Accordingly, a fast and relatively easy hardware implementation is desirable for an improved synchronization primitive which prevents a consumer to be assigned to an element for which no producer has yet been assigned.