Computer systems use various data structures to store and manipulate data during program execution. One type of such a data structure, shown in FIG. 1, is a first-in-first-out (FIFO) queue. A FIFO queue (100) is a data structure in which cells are removed in the same order in which the cells were added. The removal of an existing cell (e.g., 102) takes place at one end, typically referred to as the “head” of the queue (100), and the addition of a new cell (e.g., 104) takes place at the other end, typically referred to as the “tail” of the queue (100). The operation that adds a new cell (e.g., 104) to the queue (100) is called a “enqueue” operation, and the operation that removes a cell (e.g., 102) from the queue (100) is called a “dequeue” operation. A FIFO queue (100) may be represented in the memory of a computer system as a singly-linked list. In such a representation, each cell (e.g., 102) of the queue (100) includes a value location (e.g., 110) and a pointer (e.g., 112) to the next cell (e.g., 114) in the queue (100). A tail pointer (108) points to the youngest node in the singly-linked list, and a head pointer (106) points to the oldest node.
Concurrent FIFO queues are one type of often-used concurrent data structures. A concurrent FIFO queue is a data structure sharable by concurrently executing threads that supports the usual enqueue and dequeue operations with linearizable FIFO semantics. Linearizability guarantees that queue operations appear atomic and can be combined with other operations in a modular way. In other words, linearizability provides the illusion that each operation on a concurrent FIFO queue applied by concurrent threads takes effect instantaneously at some point between invocation and response. The threads appear to be interleaved at the granularity of complete operations, and the order of non-overlapping operations is preserved.
Generally, implementations of concurrent FIFO queues are of two types: blocking (i.e., lock-based) and non-blocking (i.e., lock-free). In general, lock-based FIFO queue implementations offer limited robustness as processes are forced to wait to access a FIFO queue until a current process completes its access to the FIFO queue.
An implementation of a data structure is non-blocking (i.e., lock-free) if the implementation guarantees that at least one thread of those trying to update the data structure concurrently will succeed in completing its operation on the data structure within a bounded amount of time, assuming that at least one thread is active, regardless of the state of other threads. Non-blocking implementations generally rely on hardware support for an atomic primitive (i.e., a synchronization operation) such as a compare-and-swap instruction or the instruction pair load-linked and store-conditional.
A compare-and-swap instruction may operate on a single memory word or two memory words in single atomic operation. A single-word compare-and-swap operation (CAS) typically accepts three values, or quantities: a memory address A, a comparison value C, and a new value N. The CAS operation fetches and examines the contents V of memory at address A. If the contents V is equal to the comparison value C, then the new value N is stored into the memory location at address A, replacing the contents V. A Boolean return value indicates whether the replacement occurred. Whether V matches C, V is returned or saved in a register for later inspection (possibly replacing either C or N, depending on the implementation). Such an operation may be notated as “CAS(A, C, N).”
A double-word compare-and-swap operation (DCAS) typically accepts six values: two memory addresses A1 and A2, two comparison values C1 and C2, and two new values N1 and N2. The DCAS operation fetches and examines the contents V1 of memory at address A1 and V2 of memory at address A2. If the contents V1 are equal to the comparison value C1 and the contents V2 are equal to the comparison value C2, then the new value N1 is stored into the memory-location at address A1, replacing V1, and the new value N2 is stored into the memory location at address A2, replacing V2. A Boolean return value indicates whether the replacement occurred. Whether V1 and V2 matches C1 and C2, respectively, V1 and V2 are returned or saved in a register for later inspection. Such an operation may be notated as “DCAS(A1, A2, C1, C2, N1, N2).”
Load-linked and store-conditional operations must be used together to read, modify, and write a shared location. A load-linked operation returns the value stored at the shared location. A store-conditional operation checks if any other processor has since written to that shared location. If not, the location is updated and the operation returns success; otherwise, it returns failure.
Concurrent data structure implementations in non-garbage collected programming languages (e.g., the C programming language) that use CAS operations are susceptible to what is known as the ABA problem. If a thread reads a value A from a shared location, computes a new value, and then attempts a CAS operation, the CAS operation may succeed when it should not, if between the read and the CAS operation, other threads change the value of the shared location from A to B and back to A again (i.e., an ABA event). A typical solution to the ABA problem is to include a tag with the target memory location such that both are manipulated atomically and the tag is incremented with updates of the target location. The ABA problem does not occur in concurrent data structures implemented in garbage collected languages (e.g., the Java™ programming language).
One approach to implementing a concurrent FIFO queue is based on the algorithm of Michael and Scott (hereinafter the “MS-queue”). See Michael, M. M., and Scott, M. L., “Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms,” Proceedings of the 15th Annual ACM Symposium on Principles of Distributed Computing (1996) pp. 267-275. A key feature of this algorithm is that it permits uninterrupted parallel access to the head and tail of the FIFO queue.
FIG. 2 shows a flow of an MS-queue implementation. An MS-queue (200) is based on concurrent manipulation of a single-linked list. The dequeue operation requires a single successful CAS operation (CAS Head (214)) on the head pointer (206) in order to complete the removal of a node (202) at the head of the MS-queue (200). The enqueue operation requires two successful CAS operations (CAS Next (218), CAS Tail (216)), one on the next pointer (212) and one on the tail pointer (208) of the node (210) previously at the end of the MS-queue (200), in order to complete the addition of a new node (204) at the tail of the MS-queue (200). Requiring two successful CAS operations to complete an enqueue operation potentially increases contention for the MS-queue (200), and there are more opportunities for failed CAS operations. In addition, overall performance may be impacted. A CAS operation takes an order-of-magnitude longer to execute than simple load or store operations because a CAS operation typically requires exclusive ownership and flushing of the instruction pipeline of the processor.