1. Field of the Disclosure
This disclosure relates generally to hierarchical locks, and more particularly to systems and methods for using flat combining to build hierarchical queue-based locks.
2. Description of the Related Art
Queue locks, such as CLH and MCS style locks, have historically been the algorithms of choice for locking in many high performance systems. These locks are known to reduce overall invalidation traffic in high performance systems by forming queues of threads, each spinning on a separate memory location as they await their turn to access a critical section or shared resource protected by a shared lock. Current trends in multicore architecture design imply that in coming years, there will be an accelerated shift away from simple bus-based designs towards distributed non-uniform memory-access (NUMA) and cache-coherent NUMA (CC-NUMA) architectures. Under NUMA, the memory access time for any given access depends on the location of the accessed memory relative to the processor. Such architectures typically consist of collections of computing cores with fast local memory (as found on a single multicore chip), communicating with each other via a slower (inter-chip) communication medium. In such systems, the processor can typically access its own local memory, such as its own cache memory, faster than non-local memory. In some systems, the non-local memory may include one or more banks of memory shared between processors and/or memory that is local to another processor. Access by a core to its local memory, and in particular to a shared local cache, can be several times faster than access to a remote memory (e.g., one located on another chip).
Recent papers show that performance gains can be obtained on NUMA architectures by developing hierarchical locks, i.e., general-purpose mutual-exclusion locks that encourage threads with high mutual memory locality to acquire the lock consecutively, thus reducing the overall level of cache misses when executing instructions in a critical section protected by the lock. For example, one paper describes a hierarchical back-off lock (referred to herein as an HBO lock). The HBO lock is a test-and-test-and-set lock augmented with a back-off scheme to reduce contention on the lock variable. The hierarchical back-off mechanism of the HBO lock allows the back-off delay to be tuned dynamically, so that when a thread notices that another thread from its own local cluster owns the lock, it can reduce its delay and increase its chances of acquiring the lock consecutively. However, because the locks are test-and-test-and-set locks, they incur invalidation traffic on every modification of the shared global lock variable, which is especially costly on NUMA machines. Moreover, the dynamic adjustment of back-off delay time in the lock introduces significant fairness issues. For example, it becomes likely that two or more threads from the same cluster will repeatedly acquire a lock while threads from other clusters starve.
Another paper describes a hierarchical version of the CLH queue-locking algorithm (referred to herein as an HCLH lock). The HCLH algorithm collects requests on each chip into a local CLH style queue, and then allows the thread at the head of the queue to integrate each chip's queue into a single global queue. This avoids the overhead of spinning on a shared location and prevents starvation issues. However, the algorithm forms the local queues of waiting threads formed by having each thread perform a register-to-memory-swap (SWAP) operation on the shared head of the local queue. These SWAPs to a shared location cause a bottleneck and introduce significant overhead. For example, the thread merging the local queue into the global queue must either wait for a long period of time or merge an unacceptably short local queue into the global queue. Furthermore, the HCLH mechanism includes complex condition checks along its critical execution path in order to determine if a thread must perform the operations of merging local CLH queues with the global queue.