The present invention relates to managing overlapping requests for resources on a computer system using locks, and more particularly to techniques to speed reconfiguration of locks among nodes of a distributed lock manager (DLM).
Computer systems are used to process data and control devices. A computer system resource is a portion of memory or a device utilized by the computer system. When several processes running simultaneously on a computer system share a resource there may be contention for that shared resource during overlapping periods of time. In such a situation a computer system management process must accumulate requests for the resource and grant them as the resource becomes available to the requesting processes. Consequently, mechanisms have been developed which control access to resources.
For example, database servers are processes that use resources while executing database transactions. Even though resources may be shared between database servers, many resources may not be accessed in certain ways by more than one process at any given time. More specifically, resources such as data blocks of a storage medium or tables stored on a storage medium may be concurrently accessed in some ways (e.g. read) by multiple processes, but accessed in other ways (e.g. written to) by only one process at a time.
One mechanism for controlling access to resources is referred to as a lock. A lock is a data structure that indicates that a particular process has been granted certain rights with respect to a resource. There are many types of locks. Some types of locks may be shared on the same resource by many processes; while other types of locks prevent any other locks from being granted on the same resource.
The entity responsible for granting locks on resources is referred to as a lock manager. In a single node computer system, a lock manager will typically consist of one or more processes on the node. In a multiple-node system, such as a multi-processing machine or a local area network, a lock manager may include processes distributed over numerous nodes. A lock manager that includes components that reside on two or more nodes is referred to as a distributed lock manager (DLM).
FIG. 1 is a block diagram of a computer system 100. A computer system 100 typically includes at least one processor 104, an internal communications bus 102 and a fast but volatile main memory 106. More permanent storage is provided by a read only memory (ROM) 108 and one or more non-volatile storage devices 110. In modern distributed computer systems, the computer system 100 is connected via a network link 120 to a local network 122 and one or more other computer systems such as host 124. The computer system can also be connected to the internet 128 either directly or through an internet service provider (ISP) 126. Over the internet, the computer system 100 can communicate with one or more other computer systems such as server 130.
FIG. 2 is a block diagram of a multiple-node computer system 200 which utilizes a conventional distributed lock manager for a distributed database. Each node has stored therein a database server and a portion of a distributed lock management system 296. Specifically, the illustrated system includes four nodes 202, 212, 222 and 232 on which reside database servers 204, 214, 224 and 234, respectively, and lock manager units 206, 216, 226 and 236, respectively. Database servers 204, 214, 224 and 234 have access to the same database 260. The database 260 resides on a disk 250 that contains multiple blocks of data. Disk 250 generally represents one or more persistent storage devices which may be on any number of machines, including but not limited to the machines that contain nodes 202, 212, 222 and 232.
A communication mechanism 270 allows processes on nodes 202, 212, and 222 to communicate with each other and with the disks that contain portions of database 260. The specific communication mechanism 270 between the nodes and disk 250 will vary based on the nature of system 200. For example, if the nodes 202, 212, 222 and 232 correspond to workstations on a network, the communication mechanism 270 will be different than if the nodes 202, 212, 222 and 232 correspond to clusters of processors and memory within a multi-processing machine.
Before any of database servers 204, 214, 224 and 234 can access a resource shared with the other database servers, it must obtain the appropriate lock on the resource from the distributed lock management system 296. The resource may be part of the database, like resource 261 which may be, for example, one or more blocks of disk 250 on which data from database 260 is stored. The resource may be on a particular piece of equipment 270. For example, the device resource 271 may be a print buffer on a printer or a scan register on a scanner.
Distributed lock management system 296 stores data structures, herein called resource locking objects (RLO), such as master RLO 208 and shadow RLO 209 on node 202, that indicate the locks held by database servers 204, 214, 224 and 234 on the resources shared by the database servers. If one database server requests a lock on a resource while another database server has a lock on the resource, the distributed lock management system 296 must determine whether the requested lock is consistent with the granted lock, i.e., can be granted simultaneously with the lock already granted, as in the case of two read locks on a block of storage currently residing in memory. If the requested lock is not consistent with the granted lock, such as when both are exclusive locks for the same resource, as is typical during writes to a database, then the requester must wait until the database server holding the granted lock releases the granted lock.
According to one conventional approach, a lock management system 296 includes one lock manager unit for each node that contains a database server and maintains one master resource locking object (RLO) for every resource managed by the lock management system 296. The master RLO for a particular resource stores, among other things, an indication of all locks that have been granted on or requested for the particular resource. The master RLO for each resource resides with only one of the lock manager units 206, 216, 226 and 236. For example, the master RLO for resource 261 resides with only one of the lock manager units, such as master RLO 238 residing with lock manager unit 236.
A node is referred to as the xe2x80x9cmaster nodexe2x80x9d (or simply xe2x80x9cmasterxe2x80x9d) of the resources whose master RLOs are managed by the lock manager unit that resides on the node. In the above example, the master RLO 238 for resource 261 is managed by lock manager unit 236, so node 232 is the master of resource 261.
In typical systems, a hash function is employed to randomly select the particular node that acts as the master node for a given resource. For example, system 200 includes four nodes, and therefore may employ a hash function that produces four values: 0, 1, 2 and 3, or four ranges of values 0-5, 6-10, 11-15 and 16-20. Each value, or range, is associated with one of the four nodes. The node that will serve as the master for a particular resource in system 200 is determined by applying the hash function to the name of the resource. For example, using the hash value ranges, all resources that have names that hash to 0-5 are mastered on node 202; all resources that have names that hash to 6-10 are mastered on node 212; etc. In this example, the resource name of resource 261 supplied as input to a hash function produces a value, e.g., 17, in the range 16-20 and is thus mastered on node 232.
When a process on a node attempts to access a resource the same hash function is applied to the name of the resource to determine the master of the resource, and a lock request is sent to the master node for that resource. The lock manager unit on the master node for the resource controls the allocation and release (or xe2x80x9cde-allocationxe2x80x9d) of locks for the associated resource. The hashing technique described above tends to distribute the resource mastering responsibility evenly among existing nodes.
In networked computer systems, some or all of the processes that are holding and requesting locks on a particular resource may be on different nodes than the master node that contains the resource locking object that corresponds to the resource. For example, the process desiring a lock and the lock resource may reside within different nodes of a multi-processor machine, or on different workstations in a local area network. Consequently, all of the messages that pass between the lock-requesting processes and the lock manager unit must be transmitted between nodes over the network. The computational power that must be expended to facilitate such inter-node messages is significant relative to the power required for intra-node communication. In addition, inter-node communication is generally slower than intra-node communication. Further, the inter-node traffic thus generated reduces the throughput available for other types of inter-node traffic, which reduction may be significant when the inter-node traffic is between workstations on a network.
In a related patent application, U.S. Ser. No. 08/669,689, DLM message traffic between nodes is reduced by introducing shadow RLOs 209, 219, 229 and 239 on the four nodes, 202, 212, 222 and 232, respectively. One or more shadow RLOs for any given resource may be spread over one or more nodes, effectively turning the master resource locking object (MRLO) into a distributed locking object. For example, resource 261, which has a master RLO 236 on node 232, has shadow SLOs 209, 219, and 229 on nodes 202, 212 and 222, respectively, to handle lock requests for resource 261 by the corresponding database servers on those same nodes. Each of the nodes that has a shadow RLO may be used to perform lock operations at that node related to the resource associated with the shadow RLO. For example, node 202 can be used to perform lock operations on node 202 related to resource 261 using shadow RLO 209, even though the master RLO for resource 261 is master RLO 238 on node 232. The Shadow RLO must communicate with the master RLO over the communication mechanism 270, but this can be conveyed according to the above patent application to minimize traffic. Besides reducing message traffic among nodes, by distributing the processing load required to perform lock management for the resource among the several shadow RLOs, this processing load is less likely to overburden the master node than in lock management systems in which all lock operations for a resource must be performed at the single master node. Without shadow RLOs, the master of several popular resources can be overburdened.
If a node leaves the system, the system is reconfigured to reflect the current cluster of available active nodes. However, the hash function assigning resources to master nodes becomes obsolete-when the number of nodes changes. For example, if node 232 leaves the system, resources that hash to the hash value range 16-20 have no node available to serve as master. In a conventional process called xe2x80x9cconventional re-mastering,xe2x80x9d a new hash function is employed which maps resource name to master node using only the available nodes, and all global resource information from all the nodes that still have open locks for the resources mastered by the departing nodes must be transmitted to the new master or masters. The DLM process of changing the resource-to-master node assignments is herein referred to as xe2x80x9cre-mapping.xe2x80x9d The DLM process including both the re-mapping and the resulting message traffic transferring lock information is referred to herein as xe2x80x9cre-mastering.xe2x80x9d The process of removing nodes from the system is referred to herein as xe2x80x9creconfiguringxe2x80x9d the system; and it involves many steps in addition to re-mastering by the DLM.
While the conventional DLM systems described above have advantages, they still have some significant drawbacks. For example, the message traffic associated with sending all global information for all resources with open locks to the new masters can significantly impair system performance. Experience with conventional re-mastering shows that it can occupy more than fifty percent of the total DLM reconfiguration time.
As an additional disadvantage, to ensure that locks are properly granted, the conventional system suspends all lock operations during reconfiguration until all resources have new masters assigned. The suspension of lock operations temporarily halts some database functions and adversely affects database performance. The suspension of lock operations is called herein xe2x80x9cfreezingxe2x80x9d lock requests.
Another disadvantage of the conventional system is that hash functions tend to distribute mastering tasks evenly over available nodes, but other considerations may make a non-uniform distribution of master RLO across the nodes optimal. The conventional system does not provide a means to achieve the non-uniform optimal distribution of master RLOs.
As one example of non-uniform but optimal distribution of master RLOs, one node may be used as standby to act when another node goes down; such a node should not serve as a master until the other node fails. As another example, one node may have more processing power than other nodes and can handle more master RLOs than the other nodes. In still another example, one node may experience better performance when serving as a master node than another node serving as master for particular resources. The first node is said to have lock affinity for those particular resources.
As another example of non-uniform but optimal distribution of master RLOs, a node may actually open more locks than expected from an even distribution of locks. Such excessive use of open locks may put the node in danger of exceeding the memory originally allocated for the RLOs. As a master node of a resource, the node needs to allocate one RLO and a number, M+N, of lock structures associated with the RLO, where N is the number of local locks and M is the number of other nodes which have open locks on the resource. For example, if node 232 is the master of resource 261, node 232 must allocate memory for one Master RLO for resource 261; and, if node 232 has 5 locks on resource 261 and if all the other nodes have open locks on resource 261, then node 232 must also allocate memory for 8 open locks. If another node is made new master of this resource, this old master node can free the memory used by M lock structures. In the above example, the node can free the memory consumed by 3 lock structures.
If a node joins the system, the conventional system is not automatically reconfigured to reflect the additional available nodes. The problem with this arrangement is that masters tend to accumulate on the original nodes and the full benefits of distributed processing are not achieved. When the system is eventually reconfigured, for example when one of the original nodes goes down, a great deal of message traffic must be passed to move data from the old master RLOs to the new masters RLOs on both the added nodes and the original nodes, further degrading performance during reconfiguration.
What is needed is an improved DLM which can be reconfigured more quickly and flexibly than reconfiguration using the conventional DML and without freezing all lock requests.
Techniques are provided for improving DLM performance and decreasing the time required for reconfiguration by spreading re-mastering tasks over one or more re-mastering events.
According to one aspect of the invention, techniques for optimizing a distributed lock manager (DLM) over a cluster of one or more active nodes for management of locks on shared resources include a system hash map initialized to establish a mapping between a plurality of hash value ranges and one or more master nodes. The management of the locks is initially distributed based on that mapping. The cluster is monitored to gather data during a time interval, including data that identifies how much resource usage is made of resources hashed to each hash value range of the plurality of hash value ranges. It is determined whether a re-mastering event condition is satisfied based on one or more factors. The factors include the resource usage. If the re-mastering event condition is satisfied, a re-mastering event is performed. A re-mastering event includes re-mapping the system hash map by replacing data that maps a replacement range set to an old set of corresponding master nodes with data that maps that replacement range set to a new set of corresponding master nodes. The replacement range set includes one or more hash value ranges of the plurality of hash value ranges. Lock information is transferred from the old set of one or more master nodes to the new set of one or more master nodes.
According to another aspect of the invention, techniques for distributing over a cluster of one or more active nodes, management of locks on shared resources,includes setting a weight for each node that may be included in the cluster. A system hash map is initialized that establishes a mapping between a plurality of hash value ranges and one or more master nodes. Management of said locks is initially distributed based on that mapping. The cluster is monitored to gather data during a time interval, including data that identifies a number of hash value ranges of the plurality of hash value ranges that are mapped to each master node of the one or more master nodes during the time interval. It is determined whether a re-mastering event condition is satisfied based on one or more factors. The factors include a first weight associated with a first active node of the cluster and the number of hash value ranges mapped to the first active node. If the re-mastering event condition is satisfied, a re-mastering event is performed. A re-mastering event includes re-mapping the system hash map by replacing data that maps a replacement range set to an old set of corresponding master nodes with data that maps the replacement range set to a new set of corresponding master nodes. The replacement range set includes one or more hash value ranges of the plurality of hash value ranges. Then lock information is transferred from the old set of one or more master nodes to the new set of one or more master nodes.