A “distributed system” is a system that includes multiple processing entities. Such processing entities are referred to herein as “nodes”. The nodes of a distributed system may be, for example, individual computers or processors within multi-processor computers.
A “cluster” is a distributed system that results from distributing computing resources together in such a way that they behave like a single resource. Clustering is often used for purposes of parallel processing, load balancing, and fault tolerance. One common example of a cluster is a set of computers, or “nodes,” that are configured so that they behave like a single computer.
Each computer in the cluster has shared access to a set of resources. A resource is, generally, any item that can be shared by the computers in the cluster. A resource may also be referred to as an item or object. A common example of a resource is a block of memory in which information is stored. The block of memory may be part of a node in the cluster or may be external to the cluster, such as a database block.
One example of a cluster is a database cluster. A database cluster comprises multiple nodes that each executes an instance of a database server that each facilitates access to a shared database. Among other functions of database management, a database server governs and facilitates access to the particular database by processing requests by clients to access data in the database.
Typically, resources are assigned to master nodes, where each master node coordinates access to the resources assigned to it. A master node has a global view of the state of the shared resources that it masters at any given time and acts as a coordinator for access to the shared resource. For example, a master node coordinates and is aware of which node is currently granted a lock on the shared resource (and what type of lock) and which nodes are queued to obtain a lock on the shared resource. Typically, the master node's global view of the status of a shared resource is embodied in metadata associated with the resource.
Clusters employing master nodes to coordinate resource sharing are sometimes described as distributed namespaces. The master nodes are said to manage a namespace of resources. The namespace describes various aspects of the resources within the cluster, such as the location and lock status of a resource (i.e. the metadata associated with each resource). Because different parts of this namespace are maintained on different master nodes, the namespace is said to be distributed across master nodes. The master nodes may therefore be considered namespace nodes. Although the concepts discussed herein will be described in terms of clusters and master nodes, it should be clear that the concepts apply equally to distributed namespaces.
Each shared resource is mapped to one master node. Various mechanisms may be used to establish the resource-to-master mapping. Techniques for using hash tables to establish the resource-to-master mapping are described in detail, for example, in U.S. Pat. No. 6,363,396. Commonly, mechanisms for establishing a resource-to-master mapping will be dependent on the number (N) of currently active master nodes (active nodes to which items are currently mapped). For example, one such mechanism employs a hashing function wherein each resource is represented by a unique number (r). The resource-to-master mapping for a particular resource is established by the function: r mod N. Thus, a resource represented by the number 128 in a cluster with 10 active master nodes would be mapped to node 8. The process for determining the node to which a resource is mapped is sometimes known as resource lookup or resolution.
Different systems may use different parameters to generate their resource-to-node mappings. The parameter values that a system uses to generate its resource-to-node mappings are collectively referred to herein as the “mapping parameter values” of the system. Thus, the number of currently active master nodes (N) is a mapping parameter value of a system that determines resource-to-master mappings based on the function: r mod N.
Typically, the mapping parameter values used by a system change in response to changes in the state of the system. Thus, when the state of the system changes, the mapping parameter values change, and when the mapping parameter values change, so does the resulting resource-to-node mappings. For example, for systems in which N is the mapping parameter value, N can vary dynamically when nodes fail.
Relying on the number of live master nodes as a mechanism for dividing up resources between master nodes assures that no resource will be orphaned (i.e. left without a master) upon the failure of its master. The remapping of an item to a new node as a result of a new resource-to-node mapping is known as re-mastering.
In conventional distributed systems, all nodes need to be informed of any changes in the system that affect the mapping parameter values, since the mapping parameter values dictate the resource-to-node mappings that the nodes should use. If all nodes are not using the same mapping parameter values, problems can arise. For example, if the number of remaining live nodes after a master node fails (N−1) is not propagated to all of the remaining nodes, then it is possible for different nodes to have different views of the live master nodes in the system, resulting in incorrect mappings. This can cause the cluster to break into two or more clusters (sometimes referred to as a split brain).
An example of “split brain” is as follows. In this example, node 0 may believe there are N nodes in the cluster and resolve object X to node 8; node 1 may believe that there are N−1 nodes and resolve X to node 9. If node 8 also believes that there are N nodes, it might fetch X from the database, as it is not aware that the most current version of the object has been recovered on node 9 (based on N−1 value fed to the hash functions). Having two nodes, each of which believes it has the latest version of the object, is a classic case of a split-brain syndrome within the cluster.
To prevent such problems from occurring, most clusters resort to cluster-wide synchronization to agree on the mapping parameter values. This synchronization halts all operations that require the assistance of any master node (e.g. locking a resource), including requests for resources managed by surviving master nodes, which should ideally be unaffected. This halting of operations results in undesirable downtime for the cluster. Furthermore, the cost of maintaining a consistent value of N increases non-linearly with more master nodes; clusters larger than 32 master nodes may experience minutes of downtime for all users during node failure. Exacerbating this problem is the fact that as cluster size increases, so does the probability of node failure. As a result, customers shy away from large clusters even though they provide greater computing capacity.
For the afore-mentioned reasons, it is highly desirable to deploy clusters in such a manner that there is zero downtime upon node failure. Therefore, there is a need for techniques to maintain reliability and consistency in a cluster without requiring a cluster-wide synchronization event.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.