1. Field of the Invention
The present invention relates to computer systems and, more particularly, to improved methods and apparatus for managing operations of clustered computer systems.
2. Description of the Related Art
In contrast to single mainframe computing models of the past, more distributed computing models have recently evolved. One such distributed computing model is known as a clustered computing system. FIG. 1 illustrates an exemplary clustered computing system 100 including computing nodes (nodes) A, B and C, storage devices (e.g., storage disks 102-104), and other computing devices 106-110 representing other devices such as scanners, printers, digital cameras, etc. For example, each of the nodes A, B and C can be a computer with its own processor and memory. The collection of nodes A, B and C, storage disks 102-104, and other devices 106-110 make up the clustered computing system 100.
Typically, the nodes in a cluster are coupled together through a “private” interconnect with redundant pathways. As shown in FIG. 1, nodes A, B and C are coupled together through private communication channels 112 and 114. For example, the private communication channels 112 and 114 can adhere to Ethernet, ATM, or Scalable Coherent Interconnect (SCI) standards. A client 116 can communicate with the clustered computing system 100 via a network 118 (e.g., public network) using a variety of protocols such as Transmission Control Protocol (TCP), User Datagram Protocol (UDP), etc. From the point of view of the client 116, the clustered computing system 100 is a single entity that can provide the client 116 with a variety of computer-implemented services, e.g., web-hosting, transaction processing, etc. In other words, the client 116 is not aware of which particular node(s) of the clustered computing system 100 is (are) providing service to it.
The clustered computing system 100 provides a scalable and cost-efficient model where off-the-shelf computers can be used as nodes. The nodes in the clustered computing system 100 cooperate with each other to provide a distributed computing model that is transparent to users, e.g., the client 116. In addition, in comparison with single mainframe computing models, the clustered computing system 100 provides improved fault tolerance. For example, in case of a node failure within the clustered computing system 100, other nodes can take over to perform the services normally performed by the node that has failed.
Typically, nodes in the clustered computing system 100 send each other “responsive” (often referred to as “heartbeat” or activation) signals over the private communication channels 112 and 114. The responsive signals indicate whether nodes are active and responsive to other nodes in the clustered computing system 100. Accordingly, these responsive signals are periodically sent by each of the nodes so that if a node does not receive the responsive signal from another node within a certain amount of time, a node failure can be suspected. For example, in the clustered computing system 100, if nodes A and B do not receive a signal from node C within an allotted time, nodes A and B can suspect that node C has failed. In this case, if nodes A and B are still responsive to each other, a two-node sub-cluster (AB) results. From the perspective of the sub-cluster (AB), node C can be referred to as a “non-responsive” node. If node C has really failed then it would be desirable for the two-node sub-cluster (AB) to take over services from node C. However, if node C has not really failed, taking over the services performed by node C could have dire consequences. For example, if node C is performing write operations to the disk 104 and node B takes over the same write operations while node C is still operational, data corruption can result.
It should be noted that the fact that nodes A and B have not received responsive signals from node C does not necessarily mean that node C is not operational with respect to the services that are provided by node C. Other events can account for why responsive signals for node C have not been received by nodes A and B. For example, the private communication channels 112 and 114 may have failed. It is also possible that node C's program for sending responsive signals may have failed but node C is fully operational with respect to the services that it provides. Thus, it is possible for the clustered computing system 100 to get divided into two or more functional sub-clusters wherein the sub-clusters are not responsive to each other. This situation can be referred to as a “partition in space” or “split brain” where the cluster no longer behaves as a single cohesive entity. In this and other situations, when the clustered computing system no longer behaves as a single cohesive entity, it can be said that the “integrity” of the system has been compromised.
In addition to partitions in space, there are other potential problems that need to be addressed in managing the operation of clustered computing systems. For example, another potential problem associated with operating clustered computing systems is referred to as a “partition in time” or “amnesia.” As is known to those skilled in the art, partitions in time can occur when a clustered computing system is operated with cluster configurations that vary over time. To facilitate understanding, consider the situation where the clustered computing system 100 of FIG. 1 is operating without node C, (only nodes A and B have been started and are operational). In this situation, if a configuration change is made to the clustered computing system 100, configuration information which is typically kept for each node is updated. Typically, such configuration information is stored in a Cluster Configuration Repository (CCR). With respect to FIG. 1, each of the nodes A, B and C has a CCR 120, 122 and 124, respectively. In this case, configuration information for nodes A and B is updated by updating information stored in the CCR 118 and CCR 120 of nodes A and B, respectively. However, since node C is not operating in this example, the configuration information for node C would not be updated. Typically, when node C comes up again, the previously updated configuration information is communicated by other nodes (e.g., A or B) to node C so that the information stored in the CCR 124 can be updated. However, if node C comes up by itself (prior to having its node configuration information updated and in a cluster configuration that does not include any of the nodes A and B), the configuration information for node C does not get updated and, thus, is incorrect. In this situation, node C does not have the updated configuration information and the clustered computing system 100 can be said to be partitioned “in time”.
To address potential problems such as partitions in time and space associated with operation of clustered computer systems, various solutions have been proposed and implemented in conventional approaches. Unfortunately, however, conventional approaches have relied on solutions that often require significant human intervention. For example, to avoid a partition in space, a human operator would have to intervene to determine if a non-responsive node is no longer operating. Similarly, human intervention would be required to keep track of different cluster configurations that are used to ensure that partitions in time do not occur.
Another problem is that conventional approaches often require and use many incongruent solutions that are implemented to account for many potential problems that may arise in operations of clustered computing systems. For example, conventionally it is common to use a particular solution for partitions in time and a different solution for partitions in space. In other words, the conventional approaches do not provide techniques that can be implemented as a consistent integral solution to avoid the various operational problems encountered in clustered computing systems.
In view of the foregoing, there is a need for improved methods for managing the operations of clustered computing systems.