In computer software, a “ring buffer”, otherwise know as a “circular buffer”, is a contiguous array of data cells which can contain arbitrary data. The data is inserted by “writers” which insert data into successive cells and read by “readers” which examine the cells in order. The key defining characteristic that makes a buffer a “ring buffer” is the fact that, on reaching the last element of the array, the writer and reader then independently loop back to the beginning of the array. Thus, a ring buffer can be thought of as an endless loop with the reader tracking behind the writer. FIG. 1 is a block diagram illustrating a single-writer, single-reader ring buffer 100. The single-writer, single-reader ring buffer 100 includes a contiguous array of memory cells 110 together with two indices, pointers, or counters 140, 150 used in a circular or ring-like fashion. Data values are placed sequentially in the cells 110 until the end of the array is reached, whereupon the placement “circles” back to the beginning of the array. The two indices 140, 150 typically follow a well-known algorithm for single-reader, single-writer queues. Ring buffers are sometimes described as first-in, first-out (“FIFO”) queues (or a queue where elements are removed in the same order they are added), but a more common meaning for a “queue” is a list based data structure which can expand to an arbitrary size. A ring buffer, on the other hand, is limited in size to a fixed number of data cells 110.
Ring buffers are commonly used in computers and data processing systems for passing information from one program, process, or thread to another. For example, a writer 120 may put references to messages into a ring buffer 100 as they are received. A reader 130 may then read these references and so access the message for further processing. As long as there is one writer 120 and one reader 130, the implementation of a lock-free ring buffer 100 is well known. The writer 120 puts data into the ring buffer 100 while making sure that it does not overtake the reader 130. The reader 130 accesses the data while ensuring that it doesn't get ahead of the writer 120. Likewise, solutions exist for non-locking access to list based queues. Unfortunately, these do not apply to ring buffers, for which no effective lock free solution currently exists as will be discussed below.
Problems are encountered when there is more than one writer 120 and/or more than one reader 130 in a multi-threaded, concurrent, shared memory environment. This situation is shown in FIG. 2. FIG. 2 is a block diagram illustrating a multi-reader, multi-writer ring buffer 200. In the case of multiple writers 220, 221, care must be taken to not have two writers 220, 221 write into the same slot 210 due to the fact that they are accessing it simultaneously. If they do, one of the references will be lost. In the case of multiple readers 230, 231, care must be taken to not have two readers 230, 231 read the same slot 240 due to the fact that they are accessing it simultaneously. If they do, the reference will be read twice instead of once resulting in duplication.
The problem encountered with multiple readers 230, 231 and multiple writers 220, 221 is greatest in environments with a large degree of parallelism (e.g., such as in today's multi-core processors), where a large amount of modularization exists (e.g., such as when processing a protocol stack one level at a time), and in systems requiring very low latency (e.g., such as real-time data communications and operating systems applications).
For reference, a data structure implementation is said to be “lock-free” if it guarantees that after a finite number of steps of any thread operating on the data structure, some thread (not necessarily the same one) operating on the data structure completes. A “thread”, short for a “thread of execution”, is a set of instructions being interpreted (i.e., executed) by a central processing unit (“CPU”) or CPU core. A thread usually has some small amount of private (i.e., to the thread) memory, and otherwise shares most memory with other threads. A “multi-threaded shared memory model” is a common model for recent multi-core CPUs where each CPU is executing one or more threads and where many of the threads share a single memory address space. Note that it is quite common for more than one thread to execute the same set of instructions at different positions in the instructions and with different private (i.e., thread local) memory. An “index” into a ring buffer is a number ranging in value from zero to the size of the ring buffer minus one. A compare-and-swap (“CAS”) operation (e.g., an atomic (i.e., indivisible) CAS) is a computer instruction typically implemented on recent general purpose processors. A load linked/store conditional “LL/SC” pair is a pair of computer instructions available on some general purpose processors which can be used to replace the CAS instruction. A “critical section” is a section of instructions for a given thread that must be executed (from the viewpoint of any other threads) as if all the instructions happened without intervening actions from other threads.
Several lock-free algorithms have been proposed in the literature. For example, Lamport (Leslie Lamport, “Concurrent Reading and Writing”, Communications of the ACM, Vol. 20, No. 11, November 1977, which is incorporated herein by reference) took a very early look at concurrent reading and writing and identified some of the problems. Herlihy and Wing (Maurice P. Herlihy and Jeannette M. Wing, “Linearizability: A Correctness Condition for Concurrent Objects”, ACM Transactions on Programming Languages and Systems, Vol. 12, No. 3, July 1990, which is incorporated herein by reference) defined a correctness condition for concurrent data structures that has been used by almost every subsequent publication in the area. Herlihy (Maurice P. Herlihy, “Wait Free Synchronization”, ACM Transactions on Programming Languages and Systems, Vol. 11, No. 1, January 1991, which is incorporated herein by reference) proved that the then-popular synchronization instructions were inadequate, and went on to show that the CAS instruction was “universal” in that it could be used to simulate any desired data structure, although very inefficiently. Michael and Scott (Maged M. Michael and Michael L. Scott, “Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms”, PODC'96, Philadelphia Pa., USA, which is incorporated herein by reference) gave the first reasonable implementation of lock-free list based queues, but there remained problems with the associated lock-free memory management as seen in Herlihy et al. (Maurice Herlihy, Victor Luchangco, Paul Martin, and Mark Moir, “Nonblocking Memory Management Support for Dynamic-Sized Data Structures”, ACM Transactions on Computer Systems, Vol. 23, No. 2, May 2005, which is incorporated herein by reference), and even there the proposed solution requires more time and space than desirable. The demonstrated difficulty of obtaining correct algorithms has led to investigation and use of alternatives to CAS or simulation of the alternatives by CAS as by Doherty, Herlihy, Luchangco and Moir (Simon Doherty, Maurice P. Herlihy, Victor Luchangco and Mark Moir, “Bringing Practical Lock-Free Synchronization to 64-Bit Applications”, PODC'04, Jul. 25-28, 2004, St. John's Newfoundland, Canada, which is incorporated herein by reference). The difficulty is also discussed by Doherty et al. (Simon Doherty, David L. Detlefs, Lindsay Groves, Christine H. Flood, Victor Luchangco, Paul A. Martin, Mark Moir, Nir Shavit and Guy L. Steel Jr., “DCAS is not a Silver Bullet for Nonblocking Algorithm Design”, SPAA'04, Jun. 27-30, 2004, Barcelona, Spain, which is incorporated herein by reference), where the development history for a double-ended list-based queue algorithm is presented, detailing the discovery of errors in the algorithm even after publication, and going on to claim that more powerful instructions than CAS are not going to make algorithm development any easier. These difficulties remain unresolved.
Given the above, it is apparent that current practice with respect to ring buffers has centered around lock-free implementations involving one writer and one reader. However, these solutions do not scale to cover the problems that arise when multiple writers or multiple readers are involved.
Two current ways of achieving lock-free access to a ring buffer in a multi-reader, multi-writer environment are as follows. First, by providing a ring buffer for every writer/reader pair (i.e., turn the problem back into a single reader/writer environment). This is disadvantageous, however, as it involves the use of many ring buffers (i.e., in the worst case N2) and an associated large increase in the amount of scheduling needed in order to decide which thread to execute. Second, by defining the piece of software which does the actual accessing of the ring buffer as a “critical section” and use a “mutex” to protect the critical section. A mutex (e.g., such as semaphores) is a mechanism for ensuring “mutual exclusion” as a means of implementing critical sections. While the execution is taking place within this critical section, all other threads which are attempting to access the ring buffer will be blocked. This is disadvantageous, however, as it works at the expense of blocking concurrent access to the ring buffer and therefore increases latency.
A need therefore exists for an improved multi-reader, multi-writer lock-free ring buffer. Accordingly, a solution that addresses, at least in part, the above and other shortcomings is desired.