Computer systems can suffer severe performance degradation as a result of high lock contention. Lock operations are associated with software locks which are used to ensure that only one process at a time can access a shared resource, such as a region of memory. The act of accessing the shared resource is also referred to as entering the critical region. Most shared-memory architectures, such as SMP (Symmetric Multi-Processor) and NUMA (Non-Uniform Memory Access) architectures also provide hardware support for making mutually exclusive accesses to shared cache data. This hardware support is known as a cache coherent mechanism. Locks are fundamental software synchronization primitives for controlling concurrent access to shared data. The performance degradation occurs under high lock contention because only one CPU can acquire the lock and do useful work while all other CPUs must wait for the lock to be released.
Spin locks are a very simple and, in the right context, efficient method of synchronizing access to shared resources. A “spin lock” is so named because if a spin lock is not available, the caller busy-waits (or “spins”) until the lock becomes available. This is called “software spin-waiting.” The algorithm for spin-waiting is very simple. Each process must check a shared lock variable. When a shared resource is required, a process calls lock( ) to test and set the variable and when the resource is ready to be released, a process calls unlock( ) to clear the variable. The lock( ) function will cause a waiting process to loop until the resource is available. The availability of the resource is defined by the value of the shared lock variable. Conventionally, if the value of this variable is zero, then the resource is available, otherwise it is in use by another process. When the resource is locked, the shared variable holds a non-zero value.
Systems employing locks typically require that a given process perform an atomic operation to obtain access to a shared data structure. In other words, another process cannot access the lock between the test and set portion of the atomic operation. The test-and-set operation is performed to determine whether a lock variable associated with the shared data structure is cleared and to atomically set the lock variable. That is, the test allows the process to determine whether the shared data structure is free of a lock by another process, and the set operation allows the process to acquire the lock if the lock variable is cleared.
The key to lock design is to achieve both fairness at high contention and performance at low contention. Fairness in the context of lock design is defined as equalizing the ability of all contenders vying for a lock to acquire that lock, regardless of the contenders' position and/or relationship to the existing lock holder. Existing locking mechanisms have achieved fairness or high performance, but not both. Specifically, three of the most widely used lock designs are: spin locks, queue locks, adaptive locks, and fairlocks.
Spin locks are widely used due to their high performance at low contention. [See T. E. Anderson., The Performance Implications of Spin Lock Alternatives for Shared-Memory Multiprocessors. IEEE Transaction on Parallel and Distributed Systems, 1(1):6-16, January 1990] In a spin lock implementation, if the test of the lock variable indicates that another process has acquired the lock to the shared memory region, the requesters for the lock initiate a loop wherein the lock variable is continuously read until the lock variable is cleared, at which time the waiting processes reinitiate the atomic test-and-set operation. In NUMA architectures, spin locks can create locking starvation due to the locking unfairness created by the large memory access latency difference across nodes. Because a CPU can access a spin lock in its local cache much faster than a CPU can access the same spin lock from another node. Therefore, the CPU that is on the same node where the spin lock is located has a much higher chance of acquiring the lock than a CPU that resides on another node.
Queue locks [see J. M. Mellor-Crummey and M. L. Scott, Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors. ACM Transactions on Computer Systems, 9(1):21-65, February 1991] and adaptive locks [see B.-H. Lim and A. Agarwal, Reactive Synchronization Algorithms for Multiprocessors. ASPLOS 1994] avoid unfairness under high contention by introducing complex lock data structures and algorithms to keep track of the usage and ownership of the lock. However, these data structures and algorithms also introduce additional locking latency overhead, therefore sacrificing locking performance at low contention.
Fairlocks provide fairness at high contention by using a bit mask to keep track of contending CPUs and enforce fairness by asking the lock-releasing CPU to explicitly yield the lock to other contending CPUs. [see S. Swaminathan, J. Stultz, J. F. Vogel, P. McKenney, Fairlocks—A High Performance Fair Locking Scheme. 14th International Conference on Parallel and Distributed Computing and Systems, November 2002] Due to the simpler data structure and algorithm used, fairlocks have better locking performance than queue locks and adaptive locks at low contention. But their locking performance at low contention is still worse than that of spin locks.
There is a need for an improved lock mechanism that overcomes the shortcomings of the prior art.