FIG. 1 illustrates a block diagram of a shared memory computer system 10 that includes plurality of processes P1, P2, . . . , Pn-1, Pn operating simultaneously and in parallel under the control of a virtual operating system 14. In such shared memory computing environments, each process is capable of directly addressing memory, indicated generally as 12, and particularly, the processes may simultaneously access a memory resource 13, creating the possibility of collision. It is understood that processes may simultaneously access a hardware or other shared resource. Thus, such a system requires a mechanism for ensuring mutually exclusive access to the shared resource, e.g., 13.
A primitive concept for providing mutual exclusion in the virtual operating system 14 and preventing conflict between two or more processors attempting to access the shared memory location is to provide a semaphore which is a mutual exclusion construct or variable having values of either binary "0" or "1" that is implemented to provide mutual exclusion. However, operating system implemented semaphores may require context switching leading to overhead on the order of thousands of instructions. As a result, such semaphores are usually quite expensive to implement. If semaphores are frequently acquired and released, and if most acquisitions are uncontested (i.e., the semaphore is not held, and no other process is trying to acquire it), then the overhead of operating system semaphores can dominate the cost of computation.
Hardware supported semaphore implementations have been developed that guarantee mutual exclusion. For instance, semaphore implementations based on hardware provided processes such as test-and-set or compare-and-swap have been developed that are several orders of magnitude faster than operating system semaphores in that they consist of instructions that atomically read and then write to a single memory location. If a number of processors simultaneously attempt to update the same location, each processor will wait its turn. For critical sections operating on the shared memory structure, a lock is needed to provide mutual exclusion, and the atomic instructions are used to arbitrate between simultaneous attempts to acquire the lock. If the lock is busy, the processor attempting to acquire the lock can either relinquish its desire to obtain the lock so it can do other work, or it can wait or "spin" until the lock is released. In particular, an implementation in which a process repeatedly tries to acquire the lock in a tight loop is called a spin lock and the activity or retrying is known as "busy waiting" or simply "spinning".
In elementary spin lock algorithms, all processors operating in a multi-processor system frequently access and attempt to write to a single lock control variable to obtain access to the shared memory location. On most modem computer architectures, each processor will also attempt to cache the control variable, i.e., spin on locations in their caches. Since each update (or possibly even each attempted update that uses a synchronization instruction) will lead to cache invalidation messages being sent to all other processors, such elementary algorithms can overburden the caching system and hence, are not viable for processor-scalable architectures, i.e., shared memory multiprocessors of arbitrary size.
Many alternative software implemented elementary spin lock algorithms have been devised and the reader is directed to Anderson, Thomas E., "The Performance of Spin Lock Alternatives for Shared Memory Multiprocessors", I.E.E.E. Transactions on Parallel and Distributed Systems, Vol. 1, No. 1, pp. 6-16, January 1990, and Graunke, Gary, et al., "Synchronization Algorithms for Shared Memory Multiprocessors", I.E.E.E. Computer, Vol. 23, No. 6, pp. 60-69, June 1990, for an assessment of performance characteristics. One spin lock, in particular, is an array based queuing spin lock that comprises an explicit FIFO queue of spinning processors that spin on their own lock flag in a separate cache block associated with the process. When one processor finishes executing on the shared resource, it de-queues itself and sets the flag of the next processor in the queue that is waiting for exclusive access to the resource, i.e., it passes ownership of the lock. The array-based queuing spin lock makes use of a variety of hardware supported instructions.
One protocol called the MCS-lock protocol was developed to enable each lock acquisition and release with as a small number of accesses to remote memory locations. A detailed description of this protocol is to be found in John M. Mellor-Crummey and Michael L. Scott, "Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors", A.C.M. Transactions on Computer Systems, 9(1):21-65, February 1991 (hereinafter the "MCS algorithm") the contents and disclosure of which are wholly incorporated by reference herein.
In the MCS algorithm, the spin lock acquisition queue is constructed dynamically by swap instructions, which is an atomically implemented function provided in the multiprocessing system architecture that exchanges the contents of a register with memory. In this algorithm, each processor of the multiprocessing system spins on a different variable in a different cache frame, with the first process in the queue holding the lock, and the next process in the queue acquiring the lock when the lock holding processor releases it. By assuring that 0(1) cache-invalidation messages will be sent by any acquisition or release of the lock in non-error cases, adding new processors does not significantly increase bus and invalidation traffic--thus, the scalability property is obtained. FIG. 1 illustrates the multiprocessing system including the MCS-algorithm spin-lock acquisition and release features 20. In the MCS-algorithm, an atomically implemented compare-and-swap function may be used for the transition during lock release from a queue to determine if it is the only processor in the queue, and, if so, to remove itself from the queue upon release of the lock Specifically, compare-and-swap compares the contents of a memory location against a first given value, returns a condition code to the user to indicate whether they are equal or not, and if they are equal, replaces the contents of the memory location with a second given value. For example, if the contents of the memory location are not equal, a NULL value may be returned.
FIG. 2A illustrates a queue node Qnode 15 that is associated with each process P1, P2, . . . , Pn of the multiprocessing system and that is a readable/writable data record or structure comprising a memory register or memory location located in the processor cache (not shown). In the MCS-algorithm, the Qnode 15 is constructed to contain a field having a locked flag 21 the contents of which represents the status of the spin lock for that process and having a value indicating whether the process owns the lock (OWNED) or whether the process is spinning (WAITING) prior to lock acquisition, or whether the lock has been released (RELEASED), i.e., the lock has been transferred to the next process in the queue. As shown in FIG. 2A, the Qnode 15 also contains a next pointer 23, which is the queue link and comprises an address location of the next member of the queue structure that will hold the lock.
FIG. 2B illustrates the lock acquisition queue structure 30 which is a linked list of one or more Qnodes 15.sub.1, 15.sub.2, . . . , 15.sub.n corresponding to one or more processors P1, . . . , Pn, one of which is holding (owns) the lock and having exclusive access to the shared resource, and the remainder of which are spinning (waiting) on a lock as indicated by the value in their locked flags. Usually, the first process Qnode, e.g., 15.sub.1, of the queue 30 holds the lock and the subsequent processes desiring to acquire the lock are spinning on the lock with its next pointer, e.g., 23.sub.1, pointing to the address of the locked flag of the next Qnode, e.g., 15.sub.2, of the queue that is to hold the lock, e.g., 21.sub.2. To add a new processor to the queue that desires lock ownership, a spin lock acquisition function is implemented by the processing system 10 which utilizes the swap function atomically to add the new processor to the queue. As shown in FIG. 2B, the MCS algorithm also makes use of a lock structure 24 which is represented by a single variable called a lock tail 25 that always points to the last node, e.g., 15.sub.n, on the queue as shown by the broken arrow and may be atomically updated by invocation of the swap and/or compare-and-swap functions.
The MCS-algorithm is vulnerable to process failure in the respect that if a process terminates while waiting for the lock, once it receives the lock it will never release it. Similarly while owning or releasing the lock, the death of the process will prevent ownership from being passed on. Particularly difficult is the window of time between an initial swap in the lock acquisition code and a subsequent assignment which links the new node into the queue by filling the next field of the predecessor process. If a process has died or terminated after executing the swap, and before setting the next field, then the queue will become "broken" at that point. Thus, in pathological cases involving multiple process failures occurring in this same window, the queue becomes fragmented into separate lists.
In attempting to recover this fragmented list, it is difficult to distinguish between a next field which has not been set due to a terminated process from one which has not been set due to a very slow process.
In view of this drawback, there is the necessity to provide in a multiprocessor system, a recoverable lock, i.e., one that does not become permanently unavailable even if one or more of the processes accessing the lock terminates. An implementation of a recoverable lock would afford the system the ability to determine when a process terminates and to make the lock available again. The capability of recovering a spinning lock is particularly useful for servers such as transaction processors which consist of several processes which are often in continuous operation. The ability to determine the process having exclusive access of a standard spinning lock in spite of any sequence of process failures is a key requirement for recoverability. If the process having exclusive access can be reliably determined, then the shared data guarded by the lock can be returned to use if that process has terminated. However, such processes cannot be determined, a lock held by a terminated process is not easily distinguishable from a lock held by a very slow process.
In view of this, it would be highly desirable to provide a recoverable spin lock protocol that maintains the integrity of the spinning lock queue structure should there be a failure of the process having exclusive access to the lock or termination or failure of one or more of the spinning locks.