The present invention relates to a locking system and method for use in a multi-node distributed clustering product.
Multi-processing systems are commonly configured in a cluster of related nodes to ensure high availability. A clustered system is a collection of processing elements that is capable of executing a parallel, cooperating application. Each processing element in a cluster is an independent functional unit, such as a symmetric multiprocessor server, which is coupled with the other cluster elements through one or more networks. One type of cluster system is described in U.S. Pat. No. 5,117,352 entitled xe2x80x9cMECHANISM FOR FAIL-OVER NOTIFICATIONxe2x80x9d issued to Louis Falek on May 26, 1992 and assigned to Digital Equipment Corporation.
In a clustered environment, there is often a need for one node to provide backup upon failure of another node. For example, in a three-node cluster, an application may be in service on node A, with node B configured as the highest priority backup node. If node A crashes, then node B begins to bring the application in service automatically. If a system administrator simultaneously attempts to bring the application in service on node C, then there is the possibility of the application being brought into service on nodes B and C simultaneously.
To prevent this possibility of the application being brought into service simultaneously on two nodes, many multi-processing systems possess either a quorum device or some other mechanism to create a single, global cluster configuration database. For these systems, it is sufficient for each node to obtain a single lock on the central cluster configuration database itself. All updates to the cluster configuration are serialized, so all nodes in the cluster have the same view of the cluster configuration insuring that only one node will attempt to bring an application into service.
Other types of clustered systems, such as systems running LifeKeeper (trademark of NCR Corp., Dayton, Ohio), possess a distributed system for storing cluster configuration information. Accordingly, each node keeps its own view of the cluster configuration (e.g. which nodes are currently servicing an application, which nodes or communication paths are alive, etc.). Clustered systems possessing such a distributed system for storing cluster configuration information may use a distributed locking system to prevent two or more nodes from making changes to the cluster configuration simultaneously. U.S. Pat. No. 5,828,876, Fish et al., issued on Oct. 27, 1998, assigned to NCR Corporation and entitled xe2x80x9cFile System For A Clustered Processing Systemxe2x80x9d describes a distributed system and is hereby incorporated by reference.
However, current distributed locking systems may allow a starvation problem typical in distributed software and prevent a thread from acquiring a cluster wide lock indefinitely. Chances of a starvation problem occurring increases with the number of nodes in the cluster. Additionally, current distributed locking systems may fail to handle a time value in a unit smaller than a millisecond and may fail to take into account many configuration features of the clustered system.
Accordingly, there is a need for an improved distributed locking system and method which avoids the problems discussed above.
In accordance with the teachings of the present invention, an improved distributed locking system and method for a clustered system having a distributed system for storing cluster configuration information is provided. One aspect of the present invention allows a process or thread in a high availability solution to obtain a distributed lock on all relevant nodes in a clustered system. Another aspect of the present invention allows more than one thread to obtain a lock and perform a critical operation on different nodes concurrently.