1. Field of the Disclosure
This disclosure relates generally to hierarchical locks, and more particularly to systems and methods for implementing NUMA-aware hierarchical locks.
2. Description of the Related Art
In a multiprocessor environment with threads and preemptive scheduling, threads can participate in a mutual exclusion protocol through the use of lock or “mutex” constructs. A mutual exclusion lock can either be in a locked state or an unlocked state, and only one thread can hold or own the lock at any given time. The thread that owns the lock is permitted to enter a critical section of code protected by the lock or otherwise access a shared resource protected by the lock. If a second thread attempts to obtain ownership of a lock while the lock is held by a first thread, the second thread will not be permitted to proceed into the critical section of code (or access the shared resource) until the first thread releases the lock and the second thread successfully claims ownership of the lock.
Queue locks, such as CLH locks and MCS-style queue locks, have historically been the algorithms of choice for locking in many high performance systems. These locks have been shown to reduce overall invalidation traffic in some high performance systems by forming queues of threads, each spinning on a separate memory location as they await their turn to access a critical section of code or shared resource protected by a shared lock.
Current trends in multicore architecture design imply that in coming years, there will be an accelerated shift away from simple bus-based designs towards distributed non-uniform memory-access (NUMA) and cache-coherent NUMA (CC-NUMA) architectures. Under NUMA, the memory access time for any given access depends on the location of the accessed memory relative to the processor. Such architectures typically consist of collections of computing cores with fast local memory (as found on a single multicore chip), communicating with each other via a slower (inter-chip) communication medium. In such systems, the processor can typically access its own local memory, such as its own cache memory, faster than non-local memory. In some systems, the non-local memory may include one or more banks of memory shared between processors and/or memory that is local to another processor. Access by a core to its local memory, and in particular to a shared local cache, can be several times faster than access to a remote memory (e.g., one located on another chip). Note that in various descriptions herein, the term “NUMA” may be used fairly broadly. For example, it may be used to refer to non-uniform communication access (NUCA) machines that exhibit NUMA properties, as well as other types of NUMA and/or CC-NUMA machines.
On large cache-coherent systems with Non-Uniform Memory Access (CC-NUMA, sometimes shortened to just NUMA), if lock ownership migrates frequently between threads executing on different nodes, the executing program can suffer from excessive coherence traffic, and, in turn, poor scalability and performance. Furthermore, this behavior can degrade the performance of other unrelated programs executing in the system.
Recent papers show that performance gains can be obtained on NUMA architectures by developing hierarchical locks, i.e., general-purpose mutual-exclusion locks that encourage threads with high mutual memory locality to acquire the lock consecutively, thus reducing the overall level of cache misses when executing instructions in a critical section of code protected by the lock. For example, one paper describes a hierarchical back-off lock (referred to herein as an HBO lock). The HBO lock is a test-and-test-and-set lock augmented with a back-off scheme to reduce contention on the lock variable. The hierarchical back-off mechanism of the HBO lock allows the back-off delay to be tuned dynamically, so that when a thread notices that another thread from its own local cluster owns the lock, it can reduce its delay and increase its chances of acquiring the lock consecutively. However, because the locks are test-and-test-and-set locks, they incur invalidation traffic on every modification of the shared global lock variable, which is especially costly on NUMA machines. Moreover, the dynamic adjustment of back-off delay time in the lock introduces significant fairness issues. For example, it becomes likely that two or more threads from the same cluster will repeatedly acquire a lock while threads from other clusters starve.
Another paper describes a hierarchical version of the CLH queue-locking algorithm (referred to herein as an HCLH lock). The HCLH algorithm collects requests on each chip into a local CLH style queue, and then allows the thread at the head of the queue to integrate each chip's queue into a single global queue. This avoids the overhead of spinning on a shared location and prevents starvation issues. However, the algorithm forms the local queues of waiting threads formed by having each thread perform an atomic register-to-memory-swap (SWAP) operation on the shared head of the local queue. These SWAPs to a shared location cause a bottleneck and introduce significant overhead. For example, the thread merging the local queue into the global queue must either wait for a long period of time or merge an unacceptably short local queue into the global queue. Furthermore, the HCLH mechanism includes complex condition checks along its critical execution path in order to determine if a thread must perform the operations of merging local CLH queues with the global queue.
More recently, it has been shown that the synchronization overhead of HCLH locks can be overcome by collecting local queues using a flat-combining technique, and then splicing the local queues into the global queue. The resulting NUMA-aware locks (sometimes referred to as FC-MCS locks) can outperform HCLH type locks by a factor of two and can outperform HBO type by a factor of four or more, but they use significantly more memory than those other locks.
Reader-writer locks are an important category of locks that help programmers overcome the scalability issues that are common with traditional mutual exclusion locks for workloads that include a significant percentage of read-only critical sections of code. At any given time, a reader-writer lock allows one or more reader threads to own a lock in a read-only mode or just one writer thread to own the lock in a write mode. With reader-writer locks, this permission persists until it is explicitly surrendered using an unlock operation. Past research has shown that even though these locks can scale well for workloads with very high reader volumes (e.g., on the order of 99-100% reader threads), the performance quickly drops off with even a modest number of writer threads (e.g. 5-10%) competing for the lock. This drop-off can be expected to be even worse on cache-coherent NUMA architectures, where the writer threads can introduce significant inter-connect traffic and latencies to access remotely situated lock metadata and data that is accessed in a related critical section of code. A reader-writer lock might provide better performance than a traditional mutex, as the reader-writer lock can admit multi-reader (reader-reader) parallelism. However, any actual benefit would be contingent on the workload of the executing application, the availability of true parallelism, and the specific implementation of the reader-writer lock.