In numerous applications, queues are used for the communication or the exchange of data between components of a system. Queues generally allow components which generate data to be decoupled from components which process or consume data and to operate these components in parallel within the system. Typical fields of application for queues are client-server systems, operating systems, network switching stations, SCADA (Supervisory Control and Data Acquisition) systems, telecommunications systems, image processing applications and systems for entertainment electronics. For example, in so-called SCADA systems, the data or data elements coming from sensors are collected, grouped and subsequently conveyed to the processes responsible for evaluation, monitoring and visualization via queues.
As a rule, known queues have FIFO semantics (First-In, First-Out). This means that the data elements enqueued first are also the first to be dequeued. There are two basic operations available for accessing queues, namely a push operation and a tryPop operation. With the push operation, data elements are enqueued, whereas the tryPop operation dequeues a data element, assuming that the respective queue is not empty.
A thread designates a thread of execution or a sequence of execution in the processing of a program. A thread forms a part of a so-called process. A process is allocated an address space and further operating system means of the respective system. A process can contain a number of threads or also only a single thread. Various threads can share processors or processor cores, a data memory and other operating-system-dependent resources, such as files and network links within a process. The advantage of efficiency in the use of threads consists in that, in contrast to processes, no complete change of the process context is necessary in the case of a change of thread since the threads use a common part of the process context. The sequence of execution and the change between threads are controlled by a scheduler. The scheduler forms an arbitration logic which controls a temporal execution of a number of processes or threads in an operating system.
If a number of threads access a queue, it is necessary to safeguard the consistency of the data. This requires a synchronization mechanism which guarantees that changes at the queue are carried out uninterruptedly. However, this leads to a bottleneck in the case of parallel computer systems which contain a number of processors or processor cores, respectively, since the accesses are ultimately sequential. This has the consequence that the scalability of the applications is greatly restricted.
Furthermore, the memory accesses accompanying push or tryPop operations can lead to a bottleneck, especially in a non-uniform memory access (NUMA) system. Such systems have a common address space but have physically separate memories. Since, as a rule, accesses to remote memory areas take much longer than accesses to a local memory, a high data locality is decisive for a high performance of the system especially in the case of NUMA systems.
There are numerous approaches for the implementation of queues as described, for example, in M. Herlihy and N. Shavit “The Art of Multiprocessor Programming”, Morgan Kaufmann, 2008. To ensure the consistency of the data, the queue is blocked globally with each access, in the simplest case, for example by means of a so-called mutex (Mutual Exclusion). This ensures that at any time only one thread can ever enqueue or dequeue a data element. A disadvantage of this known approach (exclusive approach) is the lack of scalability since all accesses are sequentiallized.
A further known approach, called blocking approach, uses separate blocks for enqueuing and dequeuing data elements. This means that at the same time one data element can be enqueued and another data element dequeued. However, it is not possible that a number of source threads in each case enqueue a data element or a number of sink threads in each case dequeue a data element at the same time. For this reason, the scalability is greatly restricted also with this known approach.
Other known approaches attempt to manage without such blocks in order to minimize the synchronization effort. These approaches operate exclusively with so-called atomic operations, i.e. uninterruptible operations and thus belong to the class of non-blocking algorithms. In the method specified in M. M. Michael and M. L. Scott “Simple, fast and practical non-blocking and blocking concurrent queue algorithms”, Symposium on Principles of Distributed Computing, pages 267 to 275, 1996 (non-blocking approach), the data elements are managed in the form of a list, the list operations being performed atomically. In this way, a number of source and sink threads can access a queue virtually simultaneously. However, this known approach with the aid of atomic operations instead of blocks can also lead to a bottleneck due to the cache coherence, particularly if the threads involved access memory areas which are close to one another.
Various known variants and refinements of the fundamental approaches described above have in common that all of them operate on a global data structure, for example use a list or a field. This inevitably leads to a bottleneck since the securing of the data consistency always requires a minimum amount of synchronization. In addition, the problem, addressed above, of accesses to remote memory areas occurs particularly in NUMA systems.