Present day computer clusters are typically geographically collocated. Such clusters are also constituted by a large number of nodes. Each node is associated with a corresponding server, computer, or other node device generally referred to simply as a machine. Clusters have resources such as storage devices, e.g., hard disks or other mass storage devices, as well as many types of peripheral resources (e.g., monitors, printers). In addition, the infrastructure of a typical computer cluster contains switches, routers, hubs and the like. With the aid of this infrastructure a client, e.g., a personal computer, can connect to the cluster via a wide area network (WAN) such as the Internet and take advantage of the cluster's services and resources. Most common services involve remote applications such as electronic mail.
Although the cluster is connected to the wide area network, it usually runs on its own separate local area network (LAN). The local network offers private connections between nodes. These can be used, e.g., for communications between nodes and other useful functions. Such functions include distribution of a synchronization signal (master clock) and delivery of heartbeat signals between nodes to verify their status or to independently access the various cluster resources.
Until recently, cluster resources, and mass storage in particular, were typically shared between the nodes. Unfortunately, shared storage resources usually introduce a single point of failure in the cluster. Furthermore, shared storage resources are very sensitive to split brain situations in which cluster nodes may be live but lose network connectivity between them. In these situations, the nodes may independently race to take over control of the cluster and its resources. This may lead to very detrimental results, e.g., when two or even more nodes manage to mount and write to file systems concurrently.
Nodes of a cluster require coordination to ensure tolerance to node failure. For this reason, one node is usually chosen as the active, leader or master node. When the master node fails, the cluster automatically switches over to a new master in a process called failover. Clearly, it is desirable to ensure that the failover process be rapid and that any service disruption experienced by the clients be minimized. This is especially true for the more recent “high availability” clusters that strive to provide virtually uninterrupted service to many clients.
Of course, prior to the advent of computer clusters, fault tolerance in individual computers was a known issue. In particular, the idea of providing computers with redundant central processing units (CPUs), power, buses, etc. and ensuring failover between them has been described by many references. For example, U.S. Pat. No. 7,441,150 to Abe discloses a fault tolerant computer system and interrupt control method that uses primary and secondary systems.
Unfortunately, the issues involved in failover between systems of a fault tolerant computer and those of a fault tolerant cluster are not sufficiently similar to merely reapply in the new context. Meanwhile, the trend in the last 20 years has been to move away from single machine design towards having distributed systems where individual machines are redundant and can fail, rather than mainframes or individual servers where each individual component is made redundant.
Among a number of prior art approaches to fault-tolerance, the reader will find many protocols for solving consensus in a network of unreliable processors or computers. Consensus is the process of agreeing on one result, such as the network leader, among a group of participants. This problem becomes difficult when the participants, i.e., the individual computers or processors, or their communication medium may experience failures. One of the most effective methods to address this problem involves voting by quorum among the participating computers to elect and change their leader. The Paxos protocol is one of the best-known prior art approaches to quorum voting and the necessary execution steps. A number of the salient aspects of this protocol are addressed in U.S. Pat. No. 5,261,085 to Lamport.
The prior art also contains numerous teachings on appropriate synchronization architecture and methods in order to speed up failover and minimize service disruption in computer clusters. For example, U.S. Pat. No. 7,194,652 to Zhou et al. teaches a “high availability” system where one control processor is “active” while another control processor is kept in a “standby” mode. The standby processor is continuously provided with state information of the active processor in the form of a “standby image”. Since the standby image is synchronized to the active image a rapid transition to the active mode by the standby processor is possible when the active control processor fails. Although this approach is appropriate for failover in high availability clusters, the method and architecture taught by Zhou et al. do not address the split brain problem.
U.S. Pat. No. 7,590,886 to Moscirella et al. also addresses the issue of facilitating device redundancy in a fault-tolerant system. The system has devices in active and standby roles. A periodic advertisement with an incrementing configuration sequence number is exchanged with each of the devices in the active role and the redundancy group to ensure fault-tolerance. The state changes of the devices are propagated asynchronously. This teaching enables a fault-tolerant system but is not appropriate for a high availability cluster with many nodes, resources and large amounts of state information. In particular, in a cluster application the teachings of Moscirella et al. would not enable efficient failover and resistance to split brain situations.
In U.S. Pat. No. 7,953,890 Katkar et al. teach how to switch to a new cluster coordination resource or cluster coordinator machine. To prevent split brain situations, the coordinator is a single machine that determines what services can and cannot run at a given point in time. In this approach each node in the cluster needs to commit to use the new coordinator resource. This means that when one or more nodes are offline the cluster or a portion of it may be disabled. Furthermore, the approach applies at the level of the entire cluster, rather than at the lower level of the individual cluster nodes. Finally, since the coordinator is a single machine, efficient failover is not provided for, unless the coordinator is brought back up.
Additional teaching on the subject of failover while assisting in the prevention of split brain situations at the cluster level is found in U.S. Pat. No. 8,001,413 to Wetmore et al. In this case, the teaching is applied at the level of entire data center sites. The data centers register with a datacenter activation coordinator who determines when the datacenter activates its services. Timeouts are used to ensure that a passive/backup data center and a formerly active data center cannot both ‘go live’ simultaneously, thereby assisting in the prevention of split brain situations. Although Wetmore's teachings do address split brain situations to avoid having two data centers coming online simultaneously, they are not appropriate for automated failover between individual cluster nodes with concurrent prevention of split brain situations between these cluster nodes.
In fact, despite the fact that many useful methods and protocols are available, the prior art does not provide an integrated and effective method to ensure failover and prevent split brain situations in a high availability cluster.