1. Field of the Invention
The present invention relates to clustered computing systems and, more particularly, to improved methods and apparatus for controlled take over of services by remaining computing nodes of the clustered computing system after one or more other nodes have been shutdown.
2. Description of the Related Art
In contrast to single mainframe computing models of the past, more distributed computing models have recently evolved. One such distributed computing model is known as a clustered computing system. FIG. 1 illustrates an exemplary clustered computing system 100 including computing nodes (nodes) A, B and C, storage devices (e.g., storage disks 102-104), and other computing devices 106-110 representing other devices such as scanners, printers, digital cameras, etc. For example, each of the nodes A, B and C can be a computer with its own processor and memory. The collection of nodes A, B and C, storage disks 102-104, and other devices 106-110 make up the clustered computing system 100.
Typically, the nodes in a cluster are coupled together through a xe2x80x9cprivatexe2x80x9d interconnect with redundant pathways. As shown in FIG. 1, nodes A, B and C are coupled together through private communication channels 112 and 114. For example, the private communication channels 112 and 114 can adhere to Ethernet, ATM, or Scalable Coherent Interconnect (SCI) standards. A client 116 can communicate with the clustered computing system 100 via a network 118 (e.g., public network) using a variety of protocols such as Transmission Control Protocol (TCP), User Datagram Protocol (UDP), etc. From the point of view of the client 116, the clustered computing system 100 is a single entity that can provide the client 116 with a variety of computer-implemented services, e.g., web-hosting, transaction processing, etc. In other words, the client 116 is not aware of which particular node(s) of the clustered computing system 100 is (are) providing service to it.
The clustered computing system 100 provides a scalable and cost-efficient model where off-the-shelf computers can be used as nodes. The nodes in the clustered computing system 100 cooperate with each other to provide a distributed computing model that is transparent to users, e.g., the client 116. In addition, in comparison with single mainframe computing models, the clustered computing system 100 provides improved fault tolerance. For example, in case of a node failure within the clustered computing system 100, other nodes can take over to perform the services normally performed by the node that has failed.
Typically, nodes in the clustered computing system 100 send each other xe2x80x9cresponsivexe2x80x9d (often referred to as xe2x80x9cheartbeatxe2x80x9d or activation) signals over the private communication channels 112 and 114. The responsive signals indicate whether nodes are active and responsive to other nodes in the clustered computing system 100. Accordingly, these responsive signals are periodically sent by each of the nodes so that if a node does not receive the responsive signal from another node within a certain amount of time, a node failure can be suspected. For example, in the clustered computing system 100, if nodes A and B do not receive a signal from node C within an allotted time, nodes A and B can suspect that node C has failed. In this case, if nodes A and B are still responsive to each other, a two-node sub-cluster (AB) results. From the perspective of the sub-cluster (AB), node C can be referred to as a xe2x80x9cnon-responsivexe2x80x9d node. If node C has really failed then it would be desirable for the two-node sub-cluster (AB) to take over services from node C. However, if node C has not really failed, taking over the services performed by node C could have dire consequences. For example, if node C is performing write operations to the disk 104 and node B takes over the same write operations while node C is still operational, data corruption can result.
It should be noted that the fact that nodes A and B have not received responsive signals from node C does not necessarily mean that node C is not operational with respect to the services that are provided by node C. Other events can account for why responsive signals for node C have not been received by nodes A and B. For example, the private communication channels 112 and 114 may have failed. It is also possible that node C""s program for sending responsive signals may have failed but node C is fully operational with respect to the services that it provides. Thus, it is possible for the clustered computing system 100 to get divided into two or more functional sub-clusters wherein the sub-clusters are not responsive to each other. This situation can be referred to as a xe2x80x9cpartition in spacexe2x80x9d or xe2x80x9csplit brainxe2x80x9d where the cluster no longer behaves as a single cohesive entity. In such situations, it is desirable to allow at most one sub-cluster to remain active. Moreover, the one and only sub-cluster remaining active should take over the services of other sub-clusters.
One problem in taking over the services of the other sub-clusters that are being shutdown is that partitions in space can occur for a brief period. In other words, if the remaining cluster begins its takeover before the other sub-clusters have stopped processing and shutdown, data corruption or data loss can result. Accordingly, take over of the services by the one remaining sub-cluster needs to be synchronized with the shutdown of all other sub-clusters. However, this synchronization is problematic partly because the disjointed sub-clusters typically do not have a mechanism to communicate with each other. In view of the foregoing, there is a need for improved methods to safely take over services from other nodes in clustered computing systems.
Broadly speaking, the invention relates to improved techniques for managing operations of clustered computing systems. The improved techniques allow one sub-cluster of the clustered computing system to safely take over services of one or more other sub-clusters in the clustered computing system. Accordingly, if the clustered computing system is fragmented into two or more disjointed sub-clusters, one sub-cluster can safely take over services of the one or more other sub-clusters after the one or more other sub-clusters have been shutdown. As a result, the clustered computing system can continue to safely provide services even when the clustered computing system has been fragmented into two or more disjointed sub-clusters due to an operational failure.
The invention can be implemented in numerous ways, including a system, an apparatus, a method or a computer readable medium. Several embodiments of the invention are discussed below.
As a method for taking over services by a sub-cluster of a clustered computing system from one or more other sub-clusters of the clustered computing system after the one or more other sub-clusters have been shutdown, an embodiment of the present invention includes the acts of: attempting to determine whether a sub-cluster of the clustered computing system is to remain active; initiating shutdown of the sub-cluster when said attempting does not determine within a first predetermined amount of time that the sub-cluster is to remain active; delaying for a second predetermined amount of time after the first predetermined amount of time expires when said attempting determines within the first predetermined amount of time that the sub-cluster is to remain active; and taking over services of one or more other sub-clusters of the clustered computing system after said delaying for the second predetermined amount of time.
As another method for taking over services by a sub-cluster of a clustered computing system from one or more other sub-clusters of the clustered computing system after the one or more other sub-clusters have been shutdown, another embodiment of the present invention includes the acts of: determining whether one or more computing nodes in a cluster have become one or more non-responsive nodes; starting a first timer when said determining determines that one or more of the computing nodes in the cluster have become one or more non-responsive nodes, the first timer has a first duration; attempting to determine whether a sub-cluster vote is at least a majority of a total votes available, the sub-cluster vote representing votes for a sub-cluster of one or more computing nodes, the sub-cluster representing a portion of the cluster that remains responsive; initiating shutdown of the one or more computing nodes of the sub-cluster when said attempting does not determine within the first duration of the first timer that the sub-cluster vote is at least a majority of the total votes available; starting a second timer after the first timer expires when the said attempting has determined within the first duration of the first timer that the sub-cluster vote is at least a majority of the total votes available, the second timer having a second duration; and taking over services from the one or more non-responsive nodes by at least one of the computing nodes of the sub-cluster after the second timer expires.
As a clustered computing system, one embodiment of the invention includes a cluster of computing nodes having at least two computing nodes, and an integrity protector provided with each one of the computing nodes. The integrity protector operates to determine whether a set of computing nodes in the cluster are to remain active. The set of computing nodes represents at least a portion of the cluster. In addition, the integrity protector operates to allow one or more computing nodes in the set of computing nodes to take over services of one or more other computing nodes of the clustered computing system only after the one or more other computing nodes have shutdown.
As a computer readable medium including computer program code for taking over services by a sub-cluster of a clustered computing system from one or more other sub-clusters of the clustered computing system after the one or more other sub-clusters have been shutdown, one embodiment of the invention includes: computer program code for attempting to determine whether a sub-cluster of the clustered computing system is to remain active; computer program code for initiating shutdown of the sub-cluster when said computer program code for attempting does not determine within a first predetermined amount of time that the sub-cluster is to remain active; computer program code for delaying for a second predetermined amount of time after the first predetermined amount of time expires when said computer program code for attempting determines within the first predetermined amount of time that the sub-cluster is to remain active; and computer program code for taking over services of one or more other sub-clusters of the clustered computing system after said computer program code for delaying has delayed for the second predetermined amount of time.
The advantages of the invention are numerous. Different embodiments or implementations may have one or more of the following advantages. One advantage is that the invention provides for controlled take over of services in a clustered computing system. Another advantage is that controlled take over can be achieved without requiring human intervention. Still another advantage is that the techniques of the invention prevent data corruption or data loss from occurring during takeover of service from other nodes that are being shutdown. Yet another advantage is that cost effective and not overly complicated implementations are possible.
Other aspects and advantages of the invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings, illustrating by way of example the principles of the invention.