1. Field of the Invention
The present invention relates to mutual exclusion locks for synchronizing access to shared data, and, more particularly, to fault tolerant mutual exclusion locks for synchronizing access to shared data.
2. Description of the Related Art
In shared memory computing systems, processes in multi-process programs communicate by reading and writing shared data objects located in a shared memory. FIG. 1 illustrates one embodiment of a shared memory computing system 100, known to one skilled in the art. As illustrated, the shared memory computing system 100 may execute one or more processes 100. One or more processes 100 can access one or more shared address spaces 110, which is also located in the shared memory computing system 100. Each of the one or more shared address spaces 110 contains one or more shared data objects 115. One or more shared data objects 115 can be protected by one or more mutual exclusion locks 120.
Updating a shared data object often involves multiple steps. A process may be interrupted in the middle of such sequences of steps, and, if the sequence is not protected by some mechanism, updates to the shared data object by multiple processes may occur concurrently and result in corrupting the shared data object. For example, a shared counter initially holds the value 10. Two processes read that value and then simultaneously write to the shared counter the value 11. The result is that the shared counter is corrupted because the correct value should be 12.
Conventionally, mutual exclusion locks are used to guarantee exclusive access to a shared data object by one process at a time. Several such locking methods are known and are widely used. Some are more suitable for small and low-contention systems, such as the Test-and-Set lock and the Test-and-Test-and Set Lock. Others are more suitable for high-contention systems, such as queue-based locks. One or more shared data objects may be protected by one or more mutual exclusion locks. To update a shared data object, for example, a shared counter protected by a lock, a process must first acquire the lock associated with the shared data object, execute a sequence of operations on the shared data object, and then release the lock. The sequence of operations in the case of the shared counter is reading the shared counter's value and then writing a new value that is one more than the previously read value (i.e., incrementing the shared counter by one). A mutual exclusion lock guarantees that the shared data object cannot be held by more than one process at the same time. A process is said to “hold a lock” if the process has acquired the lock but has not released it yet.
However, while holding a mutual exclusion lock, a process may fail for a variety of reasons, such as accidental or intentional termination by a human, lack of system resources, heuristic deadlock recovery mechanisms, etc. In such cases, without mechanisms for detecting and recovering from such a situation, the associated shared data object may remain locked indefinitely. Often human detection of the situation is needed and sometimes the only solution is restarting the system or the program.
Conventional locks in prior art do not detect and recover from process failures. FIG. 2A illustrates a possible implementation 200 of the operation of the Test-and-Test-and-Set lock, used on the vast majority of current shared memory systems. A process that needs to acquire the lock bit executes the lock acquire routine 210. The lock acquire routine 210 reads (at 215) the value of the lock bit in a private register. If the lock bit is busy (at 220) it continues to read (at 215) the value of the lock bit until the lock bit is not busy. Conventionally, the lock bit is busy if its value is one. If the lock bit is not busy (at 220), the process executes (at 225) a Test-and-Set operation on the lock bit. The Test-and-Set (TAS) operation is supported in hardware in one way or another on almost all current processors.
Referring now to FIG. 2B, the TAS operation 230 atomically (i.e., without interleaving access by other processes) reads (at 235) the value of a shared variable (in this case a lock bit). If the lock bit is clear (at 240), the TAS operation sets (at 245) the lock bit and returns (at 245) a value of one, indicating the lock bit is clear. If the lock bit is not clear (at 240), the TAS operation returns (at 250) a value of zero, indicating the lock bit is not clear.
Referring again to FIG. 2A, if the TAS operation 230 is not successful (i.e., the TAS operation 230 returns a value of one) (at 255), the lock acquire routine 210 restarts at step 215. If the TAS operation 230 is successful (i.e., the TAS operation 230 returns a value of zero) (at 255), the lock acquire routine 210 proceeds to operate (at 270) on the shared data object protected by the lock bit.
Referring now to FIG. 2C, a process that needs to release the lock bit executes the lock release routine 275. In the illustrated embodiment, the lock release routine 275 clears (at 280) the lock bit. Once the lock bit is cleared (at 280), the lock bit can be acquired by the same or another process.
It is obvious from the above description that a conventional lock such as the Test-and-Test-and-Set lock cannot recover if a process fails while holding it. Without external intervention, other processes may wait forever in the lock acquire routine 210 for a lock that will never be released. Current locks do not detect and recover from process failure that lead to a deadlock situation, where processes wait for an event that will never happen. As such, a fault tolerant mutual exclusion lock is needed to solve this problem.