The present invention relates to computer systems and, more specifically, to techniques for mastering resources within computer systems.
Database servers use resources while executing transactions. Even though resources may be shared between database servers, many resources may not be accessed in certain ways by more than one process at any given time. For example, resources such as data blocks of a storage medium or tables stored on a storage medium may be concurrently accessed in some ways (e.g. read) by multiple processes, but accessed in other ways (e.g. written to) by only one process at a time. Consequently, mechanisms have been developed which control access to resources.
One such mechanism is referred to as a lock. A lock is a data structure that indicates that a particular process has been granted certain rights with respect to a resource. There are many types of locks. Some types of locks may be shared on the same resource by many processes, while other types of locks prevent any other locks from being granted on the same resource.
The entity responsible for granting locks on resources is referred to as a lock manager. In a single node database system, a lock manager will typically consist of one or more processes on the node. In a multiple-node system, such as a multi-processing machine or a local area network, a lock manager may include processes distributed over numerous nodes. A lock manager that includes components that reside on two or more nodes is referred to as a distributed lock manager.
FIG. 1 is a block diagram of a multiple-node computer system 100. Each node has stored therein a database server and a portion of a distributed lock management system 132. Specifically, the illustrated system includes three nodes 102, 112 and 122 on which reside database servers 104, 114 and 124, respectively, and lock manager units 106, 116 and 126, respectively. Database servers 104, 114 and 124 have access to the same database 120. The database 120 resides on a disk 118 that contains multiple blocks of data. Disk 118 generally represents one or more persistent storage devices which may be on any number of machines, including but not limited to the machines that contain nodes 102, 112 and 122.
A communication mechanism allows processes on nodes 102, 112, and 122 to communicate with each other and with the disks that contain portions of database 120. The specific communication mechanism between the nodes and disk 118 will vary based on the nature of system 100. For example, if the nodes 102, 112 and 122 correspond to workstations on a network, the communication mechanism will be different than if the nodes 102, 112 and 122 correspond to clusters of processors and memory within a multiprocessing machine.
Before any of database servers 104, 114 and 124 can access a resource shared with the other database servers, it must obtain the appropriate lock on the resource from the distributed lock management system 132. Such a resource may be, for example, one or more blocks of disk 118 on which data from database 120 is stored.
Lock management system 132 stores data structures that indicate the locks held by database servers 104, 114 and 124 on the resources shared by the database servers. If one database server requests a lock on a resource while another database server has a lock on the resource, the distributed lock management system 132 must determine whether the requested lock is consistent with the granted lock. If the requested lock is not consistent with the granted lock, then the requester must wait until the database server holding the granted lock releases the granted lock.
According to one approach, lock management system 132 maintains one master resource object for every resource managed by lock management system 132, and includes one lock manager unit for each node that contains a database server. The master resource object for a particular resource stores, among other things, an indication of all locks that have been granted on or requested for the particular resource. The master resource object for each resource resides within only one of the lock manager units 106, 116 and 126.
The node on which a lock manager unit resides is referred to as the xe2x80x9cmaster nodexe2x80x9d (or simply xe2x80x9cmasterxe2x80x9d) of the resources whose master resource objects are managed by that lock manager unit. Thus, if the master resource object for a resource R1 is managed by lock manager unit 106, then node 102 is the master of resource R1.
In typical systems, a hash function is employed to select the particular node that acts as the master node for a given resource. Specifically, a hash function is applied to the resource name to produce a value. All of the resource names that hash to the same value belong to the same xe2x80x9cbucketxe2x80x9d. Each node is then assigned to be the master for all resources whose names belong to a given bucket.
For example, system 100 includes three nodes, and therefore may employ a 3-bucket hash function that produces the values: 0, 1 and 2. Each bucket is associated with one of the three nodes. The node that will serve as the master for a particular resource in system 100 is determined by applying the hash function to the name of the resource. All resources that have names that hash to the bucket associated with the value 0 are mastered on node 102. All resources that have names that hash to the bucket associated with the value 1 are mastered on node 112. All resources that have names that hash to the bucket associated with the value 2 are mastered on node 122.
When a process on a node wishes to access a resource, a hash function is applied to the name of the resource to determine the master of the resource, and a lock request is sent to the master node for that resource. The lock manager on the master node for the resource controls the allocation and deallocation of locks for the associated resource.
When the master node of a resource grants a lock on the resource to a process running on another node, the other node maintains information about the lock that its process holds on the resource. The lock information that is maintained by non-master nodes that are interested in (i.e. hold a lock on) a resource may be used during recovery in the event that the master node fails. The data structure used by a node that is not the master of a resource to track the locks on the resource that are held by local processes is referred to as a shadow resource object. For every master resource object, up to N-1 shadow resource objects may exist (where N is equal to the number of nodes in the system), since is it possible for processes on all non-master nodes to simultaneously hold non-exclusive locks on the same resource.
Changing the master of a lock resource from one node to another is referred to as xe2x80x9cremasteringxe2x80x9d the lock resource. The process of remastering a resource typically involves reconstructing a master resource object for the resource on a new master. While a resource is being remastered, the resource is generally unavailable.
At any point in time, the responsibility for mastering resources is distributed between a specific set of nodes. When that set of nodes changes, and xe2x80x9cepoch changexe2x80x9d is said to occur.
A variety of events may make it desirable or necessary to perform remastering operations. For example, when nodes fail or are shut down, then mastery of the resources currently mastered by those nodes must be shifted to other nodes. Similarly, when new nodes are added to the system, it may be desirable to shift mastery of some of the resources to the newly added nodes, thereby more evenly distributing the load associated with resource management among all of the nodes.
When resource name hashing is used to assign resources to nodes, remastering the resources in response to an epoch change typically involves changing the hash function, and then redistributing mastery responsibilities based on the resource-to-master mapping produced by the new hash function. Specifically, when the system consists of N nodes, an N-bucket hash function (a hash function that produces N values) is used to perform the master node assignment. As N changes (due to the addition, removal, or failure of nodes), so must the hash function. After a new hash function for the resource-to-master mapping has been selected, the master resource objects must be moved to their new masters. The new master for any given resource is determined by applying the new hash function to the name of the resource.
Unfortunately, the remastering that occurs in response to the adoption of a new hash function can potentially involve remastering every resource in the system. Consequently, the entire system may effectively be rendered unavailable until the remastering is completed. This is true even if the change in the system configuration is relatively small. For example, if a fifty first node is added to a fifty node system, it is possible that all resources in the system will be assigned new masters by the new hash function selected to load balance between the fifty-one nodes. A similarly drastic remastering operation may result when one node on a fifty nodes fails.
Based on the foregoing, it is clearly desirable to provide a remastering technique that reduces the remastering overhead associated with an epoch change in a multiple node system. It is further desirable to provide a remastering technique that allows a relatively balanced load distribution to be maintained after epoch changes, both as new nodes are added to the system and as nodes are removed from the system. It is also desirable to provide a remastering technique that does not effectively render the entire system unavailable for the duration of the remastering operation.
A method and system are provided for reconfiguring a multiple node system after an epoch change in a manner that reduces the overhead and system unavailability typically incurred during reconfiguration. According to one aspect of the invention, a resource-to-master mapping is established using the combination of a resource-to-bucket hash function and a bucket-to-node hash function.
According to one aspect of the invention, the resource-to-bucket hash function is not changed in response to an epoch change, while the bucket-to-node hash function does change in response to epoch changes. Consequently, the resource-to-bucket hash function is xe2x80x9cstaticxe2x80x9d, while the bucket-to-node hash function is xe2x80x9cdynamicxe2x80x9d.
Preferably, the dynamic bucket-to-node hash function is adjusted after an epoch change in a manner that load balances among the new number of nodes in the system. Further, the changes to the bucket-to-node assignments are performed in a way that minimizes the number of resources that have to be remastered.
According to another aspect of the invention, only those resources that lose their masters during an epoch change are assigned new masters during the initial reconfiguration. Load balancing is then gradually achieved by migrating resources after the system has been made available. The old masters of resources forward access requests to new masters of resources once they have transferred the master resource objects for the requested resources. In addition, techniques are disclosed for migrating resources from a node in anticipation of a planned shutdown of the node.