1. Field of the Invention
The present invention relates generally to locking/unlocking mechanisms for controlling concurrent access to objects in a digital computer system and, more particularly, to a locking and unlocking mechanism with minimal cost in both time and space.
2. Background Description
Controlling concurrent access to data structures is a fundamental problem in both uniprocessor and multiprocessor systems. In multiprocessor systems access may be truly concurrent; in uniprocessor systems interrupts and time slicing may occur in the midst of an operation that must be atomic to maintain correctness.
Concurrent access must be controlled for any shared resource that might be accessed by more than one concurrent process. For instance, database records in a bank account database must be locked so that a customer at an ATM and a teller at a workstation do not simultaneously modify the same account record. Or a printer connected to a personal computer must be locked so that a word processor and a spreadsheet do not simultaneously begin printing; instead, one application must wait until the other finishes.
One of the most popular methods for controlling concurrent access to objects is to associate a lock with each object. The term object refers to a data structure which is a unit of atomicity; other literature may also use the terms record or block. A lock is assigned to a thread of control or a process or a processor, or to whatever unit of concurrency is being employed. We will use the term thread for the unit of concurrency.
While one thread owns the lock on an object, no other thread may perform any operations upon that object. This is the principle of mutual exclusion.
If a thread attempts to lock an object and discovers that the object is already locked, it may not perform operations on that object. The thread may either (1) give up and perform other operations, perhaps attempting to lock the object again later; (2) place itself on a queue of threads waiting to be granted access to the object; or (3) continuously retry the locking operation until it succeeds (known as spin-locking).
The issues surrounding concurrency control and locking are discussed in detail in the article A Survey of Synchronization Methods for Parallel Computers, by Anne Dinning, IEEE Computer volume 22, number 7, Jul. 1989, and in the books Operating Systems Concepts by Abraham Silberschatz and James L. Peterson, Addison-Wesley 1988, and Concurrency Control and Recovery in Database Systems by Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman, Addison-Wesley 1987.
Whatever type of locking is employed, it must be implemented using operations that are atomic--uninterruptable and indivisible. Such operations are typically provided as special machine instructions, such as the CMPXCHG instruction of the Intel Pentium processors, and the Load and Reserve and Store Conditional instructions of the PowerPC processors.
The present invention is described below using an abstract atomic operation called CompareAndSwap, which can be implemented using the CMPXCHG instruction, Load and Reserve/Store Conditional instructions, or whatever atomic primitive is available on the computer hardware. CompareAndSwap takes three parameters: address, oldValue, and newValue. It examines the value stored in memory at address, and if that value is equal to oldValue, it changes it to newValue and returns true; otherwise it leaves the value at address unchanged and returns false.
The CompareAndSwap operation is atomic: any other operation on the value stored at address must either complete before the CompareAndSwap begins or must wait until the CompareAndSwap completes.
One way to implement efficient locks is to use spin locking. Each lockable object contains a one-word owner field. When a thread needs to lock an object, it just goes into a loop that repeatedly tests if the object is unlocked (lock=0), and if it is unlocked it attempts to claim the lock by setting the lock field to its own thread identifier (thread).
Spin locking has a number of major advantages: it is simple to implement; it requires only one word of space overhead in the object; and if locks are released quickly it is very efficient.
However, spin locking also suffers from some major disadvantages, particularly on a uniprocessor. If locks are not released quickly, or if contention for shared objects is high, then a large amount of computation will be wasted in "spinning". On a uniprocessor, the spin-lock loop is usually modified so that the processor is yielded every time the lock acquisition fails, in order that the thread does not waste an entire time slice in spinning while other threads are waiting to run.
With spin-locking, the queues for the objects being locked are essentially encoded in the thread scheduler. When there is not much locking, this works very well. When locking is frequent and/or contention is high, then on a uniprocessor a great deal of time is wasted in scheduling threads which immediately yield again because they still can not acquire the desired lock. On a multiprocessor, a lot of excess traffic to main memory is generated by spin-locking, and this also degrades performance. A good summary and investigation of the multiprocessor performance issues is The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors, by T. E. Anderson, IEEE Transactions on Parallel and Distributed Systems, volume 1, number 1, January 1990.
Finally, with spin-locking, the order in which locks are granted is non-deterministic and potentially unfair. That is, the first thread to attempt to lock an object may have to wait arbitrarily long while many other threads obtain the lock.
The primary alternative to spin-locking is queued locking. When a thread fails to obtain a lock on an object, it places itself on a queue of threads waiting for that object, and then suspends itself. When the thread that owns the lock releases the lock, it checks if any threads are enqueued on the object. If so, it removes the first thread from the queue, locks the object on behalf of the waiting thread, and resumes the waiting thread.
Unlike spin-locking, queued locking is fair. Performance is good except when objects are locked for short periods of time and there is contention for them. Then the overhead of enqueueing and suspending becomes a factor. However, when objects are locked for longer periods of time and/or when contention is low, queued locking is generally more efficient than spin-locking.
The basic problem with queued locking has to do with the management of the queues. The queues for a shared object are themselves shared objects (even while the object is locked). Therefore, some sort of mechanism is required to assure mutual exclusion on the object queues.
Furthermore, there is a race condition inherent in the lock release policy: one thread may attempt to enqueue for the object at the same time that the owning thread is releasing the lock.
The simplest way to solve both of these problems is to use a global spin-lock to guard the short critical sections for lock acquisition, release, and enqueueing. Every object now contains not only a lock field but also a queue field.
Unfortunately, locking an unlocked object (the most common case) has now become significantly slower and more complex. There is also a global lock for which there could be significant contention as the number of threads increases (that is, the solution does not scale).
However, provided with some extra hardware support, this problem can be solved; in particular, with an atomic CompareAndSwapDouble machine instruction that atomically compares and swaps two words. Such hardware support is available on Intel Pentium processors in the form of the CMPXCHG8B instruction.
With CompareAndSwapDouble, an atomic operation can be performed which simultaneously releases the lock and makes sure that the queue of waiting threads is empty.
However, there are a number of serious drawbacks to using CompareAndSwapDouble: (1) it is slower than the single-word CompareAndSwap operation, (2) it requires that the lock and the queue be adjacent in memory, reducing flexibility and potential for space optimization, and (3) it is not available on many processors.