Most computer systems manage resources. The nature of the resources managed by a computer system may vary from system to system. For example, in database systems, the resources managed by the system may include tables, rows, and disk blocks. In file systems, the resources managed by the system may be files and folders.
Often, it is desirable to maintain resource information about resources that are managed by a system. Just as the nature of resources may vary from system to system, so may the type of resource information that is maintained about the resources. For example, in a database system, it is often necessary to regulate access to shared resources. Thus, such systems typically maintain resource information that indicates what locks have been requested and/or granted on shared resources. In other systems, the resource information may simply be values that indicate some information about the resources.
Systems that maintain information about resources typically include access structures for efficiently retrieving the resource information. Even when the resource information is stored in volatile memory, the absence of such access structures may result in unacceptably long delays, especially when accessing the resource information is in the critical path of an operation. Various types of access structures, including hash tables, b-tree indexes, and name-value lookup directories may be used for this purpose. The term “resource index” shall be used herein to generally refer to any type of structure or mechanism used for accessing such resource information.
For increased efficiency, resource indexes (and the resource information itself) may be maintained in volatile memory so that the retrieval of the resource information does not incur the relatively long delays associated with accessing non-volatile storage. However, the techniques described hereafter may be equally applied in systems where the resource indexes and/or the resource information are wholly or partially stored in non-volatile storage.
In multiple-node systems, it is common to distribute the responsibility of maintaining the resource information among the various nodes of the system. For example, each node of a five node system may be responsible for managing the resource information for 20% of the resources used by the system. The node that maintains the resource information for a specific resource is referred to as the “master” of that specific resource. Each node will typically maintain its own volatile resource index to efficiently access the resource information for the resources that the node masters.
Once it has been determined which nodes will master which resources, it may be desirable to change the resource-to-master assignments. An operation that changes the resource-to-master assignments is referred to as a “remastering” operation. Remastering may be necessitated for any number of reasons. One such reason, for example, is to ensure that the master node for a set of resources is the same node that has the affinity of access to the set of resource names or keys.
One problem presented by remastering operations is what to do about the resource information, and the resource indexes that are used to access resource information. Typically, both the resource indexes, and the global resource information that the resource indexes are used to access, must be rebuilt as part of the remastering operation. One approach would be to completely stop or freeze accesses (both reads and writes) to the resource indexes at the start of the remastering operation. After the remastering operation, the existing resource indexes can be deleted, and each resource index can be rebuilt based on information that is available in each node. For example, if the resource information is a locking data structure, then for each resource, each node would send the lock mode held on the resource to the new master node for that resource, so that the new master can rebuild the global resource information. If the resource information is a name-value directory entry for a resource, each node would send the name, value pair to the new master node for the given resource. One disadvantage to this approach is that accesses to the resource index are blocked until the entire index is rebuilt.
Another approach, referred to herein as the “window-based approach”, involves dividing the resources into “windows”. The windows may correspond, for example, to different ranges of resource names. Once the windows have been established, the remastering may be performed one window at a time. At any given time, the only portion of a resource index that needs to be locked is the portion that is associated with the window of resources that is currently being remastered. Each resource index is then rebuilt one “window” at a time. The window-based approached is described in the Window-based Remastering Application.
The window-based approach described in the Window-based Remastering Application works in the following two cases:
CASE 1: the resource indexes are hash indexes, the hash index on each node uses the same hash function, and the hash tables in each node are the same size.
CASE 2: the resource indexes are hash indexes, the hash index on each node uses the same hash function, and the resource hash tables are of different sizes, but the hash table sizes are a multiple of each other.
As an example of how remastering is performed in case 1, assume that there are 100 hash buckets and that the remastering is going to be performed using 5 windows. In the first window, the system freezes accesses to all resources that hash to buckets 1 . . . 20, and rebuilds this part of the hash table. In the second window, the system would freeze accesses to all resources that hash to buckets 20 . . . 40, and so on. Because each node uses the same hash function, any node that has information pertaining to a resource will send the information for the resource in the same window. For example, if the resource information is a locking data structure, two nodes that have a read lock on a given resource will resend the information regarding the read lock to the new master in the same window (the window to which the resource belongs) and the old master would also have frozen accesses to the old resource in this window and would correctly delete the resource.
As an example of how remastering is performed in case 2, assume that the resource hash tables are of different sizes in each node, but that the sizes of the hash tables are a multiple of each other. In this scenario, the node with the smallest hash table size chooses the boundaries of the window, i.e. the start and end bucket number. Each window is constrained to be a contiguous sequence of buckets. Using the example above, if a node has 200 hash buckets and the smallest node has 100 hash buckets, when the smallest sets the window to be buckets 20 . . . 40 in its hash table, the node with 200 hash buckets would consider buckets 20 . . . 40 and buckets 120 . . . 140 in the window. Because the same hash function is used (i.e. a hash value modulo number of hash table buckets) a resource that hashes to a bucket number between 20 and 40 in the node that has 100 hash buckets is guaranteed to either hash to a bucket number between 20 and 40 or between 120 and 140 in the node that has 200 hash buckets.
While the approach set forth in the Window-based Remastering Application works well in the two cases described above, the constraints imposed by the approach reduce its usefulness. For example, it may be desirable for one or more of the resource indexes to be a b-tree index rather than a hash index. Even when hash indexes are used, it may be desirable to select the size of the hash table of each node based on the memory available in the node, without being restricted by the size of the hash tables used by the other nodes.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.