Within a computer system, a cluster is a collection of processing elements that is capable of executing a parallel, cooperating application. Each processing element in a cluster is an independent functional unit, such as a symmetric multiprocessor server, which is coupled with the other cluster elements through one or more networks, e.g., LANs, WANs, I/O buses. One type of cluster system is described in The VAX/VMS Distributed Lock Manager, by W. E. Snaman, Jr. and D. W. Theil, published in Digital Technical Journal, September 1987, and in U.S. Pat. No. 5,117,352, entitled "MECHANISM FOR FAIL-OVER NOTIFICATION" issued to Louis Falek on May 26, 1992 and assigned to Digital Equipment Corporation.
A parallel cooperating application in the context of a cluster executes on multiple cluster nodes and processes a shared object such as a database. A lock manager is required by such an application to synchronize and coordinate its activities on the shared object. Specifically, such a parallel application defines a set of locks, each of which control a portion or portions of the shared object(s) that the parallel application will process. Each parallel instance of the application is in agreement with each other with respect to the interpretation of the set of locks as defined. When an instance of the parallel application needs to access, e.g., read, modify, etc., a portion of the shared object, it needs to obtain a lock from the Lock Manager that provides it access privileges relevant to its desired operation on that portion of the shared object. Since the set of locks need to be accessed from within any of the instances, it must be a global entity, and the lock manager by definition needs to be a global or clusterwide resource.
A typical example of such an application is the Oracle Parallel Server. A typical clustered system configuration running the Oracle Parallel Server Application, using a Distributed Lock Manager, is depicted in FIG. 1. The system, as shown, includes multiple processor units 101 interconnected through a network 103, such as an Ethernet of Fiber Distributed Data Interface, and connected through a shared SCSI bus 105 to one or more database storage units 107.
The need for parallel applications on today's open systems has been generated from two basic requirements:
Increased throughput of the application, and
High availability of the application.
A clustered system must accordingly be designed such that no system element or component represents a single point of failure for the entire cluster. If the Lock Manager executed off of any one node of the cluster, or on a piece of dedicated hardware, then a failure of that node or the hardware would adversely affect all instances of the parallel application, since it cannot survive without the services of a Lock Manager. If on the other hand, the Lock Manager is distributed, then the surviving nodes can be designed to recover the lock database upon a node failure, and allow the parallel application on these nodes to continue their processing.
A Distributed Lock Manager (DLM) should also be capable of scaling its throughput capabilities along with the addition of nodes to the cluster. Since a DLM is not confined to a single node or a subset of nodes, it can take advantage of the increase in processing power along with the increase of nodes within the cluster.
Further, a DLM should allow for even distribution of lock management overhead across each functional element of the cluster on which the parallel application is executing. This way, no single node or subset of nodes are unevenly burdened with the responsibility of lock management.