In computer science, a data structure is a way of storing data in memory so that it can be accessed and used efficiently. There are different kinds of data structures each suited to different kinds of applications and/or specialized tasks. In fact, a carefully chosen data structure will allow more efficient algorithms to be used.
Queues are data structures to which data elements are added and removed, with the property that the elements are removed in the order that they are added (known as “first-in, first out” or FIFO). Basic operations are add to the “tail” (the last item of a list), or enqueue and remove from the “head” (the first item of a list) or dequeue. In software design, Queues often function to decouple one or more producers of data elements from one or more consumers of the data elements. The producers and consumers are frequently different threads of execution within the same process. The single queue tying together a set of producers and consumers is maintained in memory shared by all (easily achieved by threads executing in a single process address space) and is acted upon directly by each of the different threads representing the producers and consumers.
Threads of execution (“threads”) are a way for a program to split itself into a plurality of (two or more) simultaneously running sequential tasks. Multiple threads can be executed in parallel on many computer systems. Such “multithreading” occurs by time slicing wherein a single processor switches between different threads or by multiprocessing wherein threads are executed on separate processors. Many operating systems directly support both time-sliced and multiprocessor threading with a process scheduler. The operating system kernel allows programmers to manipulate threads via the system call interface. Some implementations are called kernel threads or “lightweight” processes.
In the context of software execution, a thread is a sequence of machine instruction execution as specified by written software. It has local state in the form of registers and stack memory. Stack memory is needed to maintain values when there is more state to be kept than there are registers to keep it. Also, because of synchronous subroutine calls (“function calls”), the stack also keeps the current function state while it waits for a function it invoked to return.
Significantly, instructions are executed sequentially within a single thread. Changes to the state of memory occur one instruction at a time. Problem solving via software has been an exercise of determining how to go from some initial state to some final state via a sequence of small state changes. This simplifies reasoning about program execution since one need only consider the current state and how the next instruction modifies it.
Processes always have at least one thread. Before the advent of the notion of having multiple threads in a single process, there was no need to refer to the sequence of instruction execution separately from a process; there was a one-to-one relationship between the two. Threads also differ from processes in the way that they share resources. Specifically, threads are distinguished from traditional multi-tasking operating system processes in that such processes are typically independent, carry substantial state information, have separate address spaces, and interact only through system-provided inter-process communication mechanisms. Multiple threads, in contrast, typically share the state information of a single process, and share memory and other resources directly.
“State” can mean different things depending on the context and the level of abstraction. At the threshold, there is “state” of memory, which is simply the actual content at some particular moment in execution. Most frequently, however, programmers are interested in a small subset of memory whose content is relevant for a particular situation. For example, the “precondition” for executing a function is a requirement on the state at the point of function invocation in terms of the value of meaningful objects defined in memory at the time. “State” may also be defined at a higher level of abstraction in the sense of a finite state machine as discussed in further detail below.
Processes “own” resources allocated by the operating system including, for example, memory threads, file handles, sockets, device handles, and windows. Significantly, processes do not share address spaces or file resources except through explicit methods such as shared memory segments. If multiple threads can exist within a process, then they share the same memory and file resources. Threads are preemptively multi-tasked if the operating system's process scheduler is preemptive. However, threads do not own any resources with the exception of a stack, thread-specific data, and a copy of the registers including the program counter.
An operating system creates a process for the purpose of running an application program. Every process has at least one thread with most operating systems allowing processes to have multiple threads. Multiple threads allow a process to perform multiple functions concurrently. Since the threads generated by a program share the same address space, one thread can access and modify data that is used by another thread. This can be problematic. On the one hand, such shared access promotes easy communication between and among threads. On the other hand, programming errors can result in one thread inadvertently overwriting data being used by another thread.
As indicated above, threads act upon a queue by invoking the two defined operations, add and remove, which add an element to the “tail” and remove an element from the “head” of the queue, respectively. The implementations of the operations expect the queue to be in a particular state when invoked and leave the queue in a particular state when the operation is completed. During execution, the operations read the state of the queue from memory, modify the value representing the state, and then write the new value back to the memory representing the queue.
With multiple threads operating on a single queue at the same time, there can be multiple operations executing simultaneously. This will generally lead to errors in modifying the queue state because one thread will read the state and begin modifying it while another thread changes the queue state to something incompatible with what the first thread is going to write back. This occurs because it is generally assumed in the implementation of operations that nothing else is modifying the queue while the operation is executing, i.e. operations execute atomically.
In concurrent programming (programs that use language constructs for concurrency, including multi-threading), atomicity is equivalent to linearizability with the additional property that none of its effects are visible until after it completes.
In atomicity, there are no intermediate steps visible to other threads. In operating systems, an “atomic” operation is one that is not (or cannot be) interrupted once it has started. Thus, basic instructions such as add or store are usually guaranteed by the hardware to be atomic. Some platforms also provide a pair of operations (load-link/store-conditional) that only have an effect if they occur atomically. Such a property is used to implement “locks” in multithreaded programming as discussed below. Accordingly, atomicity is used to prevent read-write and write-write conflicts.
To avoid the above queue modification errors, threads “take turns” executing operations on a shared queue, i.e., access from multiple threads is serialized. This discipline of access is enforced in the operations by using mutual exclusion locks (“mutexes”) that block all threads but one from executing an operation. When one thread is done executing an operation, the next thread waiting is then allowed to execute.
Mutual exclusion locks are used in concurrent programming to avoid the concurrent use of shared resources. A “critical section” is a span of code that, because it manipulates shared resources such as memory, can be executed by only one thread at a time. That span is defined as that between the lock and the unlock of a mutex (other synchronization primitives such as semaphores, can also be used). Critical sections are necessary because a thread can be switched at any time thereby offering another thread the opportunity to change shared data of the first thread. As readily seen, such switching may lead to inconsistent data.
In summary, with the exception of the additional overhead incurred from mutex implementation, serial access to queues is not much of a problem in single processor systems from a resource allocation standpoint. However, when the process is executing on a multiprocessor system, such serialization of operation execution reduces the gain in throughput that would have otherwise occurred by having the threads executing on the multiple processors simultaneously.
To achieve simultaneous execution of queue operations, they must not require serialization. This can happen by, instead of simply requiring a consistent state between complete operations, requiring a consistent state between each atomic machine instruction. This requirement can be somewhat relaxed to requiring a consistent state between each atomic modification to the shared queue state within operation execution.
Algorithms that permit multiple simultaneous executions on a shared object, such as a queue, are known as “lock free” (generally because they avoid the use of mutex locks, as discussed above). In computer science, an algorithm is understood as a set of defined instructions for accomplishing a task which, given an initial state, will terminate (produce an answer after running for a finite number of steps) in a corresponding recognizable end-state. Algorithms are, of course, essential to how computers process information since, in essence, an algorithm simply tells the computer what steps to perform (and in what order) to carry out a specified task. Thus, an algorithm is considered to be any sequence of operations which can be performed by a Turing-complete system (i.e. a programming language or any other logical system that has computational power equivalent to a universal Turing machine.).
In contrast to algorithms that protect access to shared data with locks, “lock-free” algorithms are specially designed to allow multiple threads to read and write shared data concurrently without corrupting it. An algorithm is said to be “wait-free” if every thread will continue to make progress in the face of arbitrary delay (or even failure) of other threads. Thus, a “wait-free” algorithm can complete any operation in a finite number of its own steps, regardless of the actions, timing, interleaving, or speed of other threads. By contrast, a “lock-free” algorithm requires only that some thread always make progress. “Lock-free” thus refers to the fact that a thread cannot lock up, i.e. every step it takes brings progress to the system. This means that no synchronization primitives such as mutexes or semaphores can be involved, as a lock-holding thread can prevent global progress if it is switched out. As readily seen, a lock-free algorithm is not necessarily wait-free.
By necessity, lock-free manipulation of shared object state still requires a read-modify-write sequence. That is, read the object state in shared memory into local memory (registers ultimately), modify the values according to changes being made, and then write the values back to the shared object memory. To avoid the potential inconsistency by having multiple threads making changes at the same time, the changes are written back only if the shared state hasn't changed since it was read by the thread attempting to make the change. This check of whether the state has changed, however, necessarily requires a read instruction, compare instruction, branch instruction, and write instruction. Accordingly, they raise the same problem referenced above, i.e., another thread can write the shared memory after it was read and before it was written.
In response to this problem, hardware designers have included special instructions known as conditional synchronization primitives that atomically perform the read-compare-branch-write as a single hardware instruction. There are two common types of these instructions: compare-and-swap (CAS) and load-linked/store-conditional (LL/SC). The CAS instruction atomically compares the contents of a memory location to a given value and, if they are the same, modifies the contents of that memory location to a given new value. More specifically, the CAS instruction takes three arguments: a memory address, the expected value, and the new value; sets the memory address to the new value if the memory has the expected value; and returns a Boolean value, depending on whether the value at the address was changed. The CAS instruction is used to implement higher level synchronization mechanisms, such as semaphores and mutexes, in addition to being used to implement lock-free algorithms.
The load-linked/store conditional (LL/SC) is a pair of instructions. Load-link loads a value into a register from memory. A subsequent store-conditional instruction assigns to that memory a new value if the memory hasn't changed since the load-link and returns a Boolean value depending on whether assignment took place.
Conditional synchronization primitives are limited to acting on a single word of memory. Therefore, lock-free algorithms must be designed such that critical transitions, i.e. from one consistent state to another, can be affected by the modification of shared state contained within that single word of memory.
Given how conditional synchronization primitives operate, the general approach in lock-free algorithms is to: (1) read shared state into local memory (typically registers); (2) modify values in local memory to effect the desired operation; and (3) attempt to write back the changed values to the shared memory using the CAS instruction. If the CAS instruction fails, i.e. some other thread modified the shared state between the read and the CAS, the operation loops back to try again, starting with reading in the updated values of the shared state.
There are general approaches to transforming a standard sequential object (i.e. data structure and associated algorithms) implementation to a lock-free design. See, for example, Maurice P. Herlihy, “A Methodology for Implementing Highly Concurrent Data Objects”, ACM Transactions on Programming Languages and Systems, 15(5):745-770, November 1993. These approaches, however, have generally exhibited poor performance—even as compared to designs using standard locking mechanisms. Accordingly, current design objectives now focus on finding and implementing lock-free algorithms specific to particular objects. See, for example, Maged M. Michael, “High Performance Dynamic Lock-Free Hash Tables and List-Based Sets”, Proc. 14th Annual ACM SYMP. Parallel Algorithms and Architectures, pp 73-82, August 2002; Maged M. Michael, ‘CAS-Based Lock-Free Algorithm for Shared Deques”, Proc. Ninth Euro-Par Conf. Parallel Processing, pp. 651-660, August 2003; Maged M. Michael and Michael L. Scott, “Simple, Fast and Practical Non-blocking and Blocking Concurrent Queue Algorithms,” Proc. 15th Annual ACM Symp. Principles of Distributed Computing, pp. 267-275, May 1996; and William N. Scherer III and Michael L. Scott, “Nonblocking Concurrent Data Structures with Condition Synchronization”, 18th Annual Conf. on Distributed Computing, October 2004. As readily seen, this is a fairly ad-hoc process.
Other issues also arise in the design of lock-free algorithms that must be addressed. One is known in the art as the ABA problem which arises when a CAS instruction can't make the distinction between the memory location having never been changed and being changed but then being changed back to the expected value before the CAS instruction is executed. Assumptions associated with the expected value can change. A common solution to the ABA problem is to append a counter to the value in memory being updated. See, IBM System/370 Extended Architecture, Principles of Operation, 1983, Publication No. SA22-7085. The counter is incremented each update, so even if the same value is assigned to the location, the update counter will be different.
Another problem associated with lock-free designs is memory reclamation. Given that multiple threads can be executing operations simultaneously, even though one thread has determined that a shared object is no longer needed, it is sometimes difficult to be certain that no other thread is attempting to access that shared object. Returning the object to the memory allocator could result in runtime errors if other threads are attempting to access the object. Solutions have been identified, but at the cost of higher complexity. See, Maged M. Michael, “Hazard Pointers: Safe Memory Reclamation of Lock-Free Objects”, IEEE Transactions on Parallel and Distributed Systems, Vol. 15, No. 6, pp. 491-504, June 2004.
One other problem associated with lock-free implementations is the impact of compiler optimization when writing in higher level languages, e.g. C++. For reference, C++ (pronounced “C plus plus”) is a general purpose computer programming language. It is a statically typed multi-paradigm language supporting procedural programming, data abstraction, object oriented programming, and generic programming, and is currently one of the most popular commercial programming languages. The optimization issue is that compilers assume serial execution. Therefore, they attempt to limit memory access by caching values in registers rather than loading from memory at each access when there is no evidence in the code of the memory location being changed. This is a reasonable assumption for strictly sequential access. However, when it is possible for multiple threads to be changing shared memory simultaneously, it is important that each read of shared memory loads the value from memory. To avoid this optimization, C++ provides the volatile type modifier, which disables this particular optimization and forces a load from memory each access. The object in shared memory could be an instance of a type that is accessed via member functions. In such case, member functions can be declared volatile, thus ensuring that the member function implementations will load from memory at each access.
One of the earliest efforts in lock-free algorithms was in the early 1980's when the CAS instruction was included in the above referenced IBM System/370 processor architecture. The starting point for this work was a lock-free queue that was simple, truly non-blocking, and depended on only the CAS instruction commonly available on most hardware. This queue provided excellent performance on even a single processor, along with linear speedup (i.e. scalability) on shared memory multiprocessors. However, the queue had no means of blocking a thread making a request on an empty queue and to wait until a data element is added. This is an important property in most real-world applications and is generally implemented using condition variables along with mutexes.
More recently, efforts have been made to overcome the lack of condition synchronization in the above queue design. See, for example, the lock-free dual queue (S+S queue) disclosed in the above referenced paper William N. Scherer III and Michael L. Scott, “Nonblocking Concurrent Data Structures with Condition Synchronization”, 18th Annual Conf. on Distributed Computing, October 2004. As disclosed, this queue supports having threads wait when performing a remove operation on an empty queue. The waiting thread will then continue once a data element is added to the queue. An important aspect of this design is that it is a “dual” queue. That is, it will queue requesting threads when there are no data elements so that when data elements are added, the request will be filled in the order in which they are made. Thus, the first-in-first-out protocol is maintained for both the requesting threads and the data elements.
The S+S queue, however, had two major defects that made it unusable for most practical systems. First, the ability to have threads wait for a data element to be added was implemented by a busy wait, i.e. a loop that continuously checks to see if a value has changed. Busy waits are acceptable only when the delays are minimal (such as adaptive mutexes that use spin locks). If, however, the wait goes into the range of tens of milliseconds, it not only uses up resources that could be more productively used otherwise, but it also warps scheduling algorithms. Second, the S+S queue precluded any time limit from being defined on how long a thread should wait. This is important in practical systems so that a thread can fail waiting for a data element after some specified amount of time, thus, for example, allowing for recovery from late or missing events elsewhere in the system. One other issue is that the S+S design utilized such a complex algorithm that it was difficult to create an implementation that didn't have race conditions, especially in the area of memory reuse.
In summary, current software uses mutex locks that enforce the unique access of queue critical sections. Such a mutex-based approach greatly diminishes parallel efficiency on multi-processor systems. And the current best published lock-free dual queue suffers from at least two major shortcomings, namely: (1) a busy-wait implementation that consumes excessive CPU cycles; and (2) the inability to terminate the wait for a data element after a specified time-out interval.
Consequently, a need exists for an improved lock-free dual queue that overcomes the above problems.