1. Technical Field
The present invention relates generally to a distributed data processing system and in particular to a method and apparatus for managing a server system within a distributed data processing system. Still more particularly, the present invention relates to a method and apparatus for handling network communication failures among servers within a distributed data processing system.
2. Description of Related Art
Multiple computers may be employed to increase performance of a computing site or to avoid problems associated with single computer failures. These computers are used to form a cluster, which is also referred to as a clustered computer system. An individual computer within a cluster is referred to as a cluster server, cluster member, or cluster node.
Generally, cluster nodes communicate with each other over a network. If a network communication failure occurs, the cluster may be partitioned into two or more parts. If cluster servers in a partition are unable to determine the status of cluster servers outside of the partition, continued application processing may result in a condition referred to as split-brain operation. To a subset A of cluster nodes, it is unclear whether the node(s) in some other subset B are actually operational or are simply unable to communicate with subset A. Such a situation is dangerous, as it can result in corruption of data maintained by the cluster or incorrect processing results.
For example, if a clustered computer system, containing two cluster nodes, is partitioned by severing links which are used for cluster communication between the nodes, each node will be unable to determine the state or status of the other. Further, any mutual exclusion mechanisms which depends on the severed link(s) will be inoperable or will yield incorrect results. This can result in both nodes deciding that it is proper to control a resource which is only safely controlled by one node at a time. Such a condition can result in corrupted data or incorrect processing results. A common example of such a resource is a file system residing on a disk connected to both nodes.
Corruption of a shared database is the most common manifestation of split-brain operation, though certainly any mutually-accessible resource may be affected. So more specifically, split-brain operation would be defined as a condition involving two or more computers in which mutually-accessible resources are not under the control of any mutual exclusion mechanism.
Clearly, to avoid a split-brain condition, mutual exclusion mechanisms must be preserved. Traditionally, high-availability systems have relied on various methods to minimize the probability of a split-brain condition. These include such things as redundant communication links and deadman timers. Each of these mechanisms has its strengths and weaknesses. Because of this, it is common for multiple links and methods to be used concurrently.
Redundant communication links are commonly used for split-brain prevention. These include such things as secondary network links, asynchronous (TTY) links, or device-bus links (of which target-mode SCSI is an example). A common use of a redundant link is to provide what is known as a heartbeat capability. Generally, a heartbeat operation is nothing more than an ongoing sequence of messages from one communication endpoint. (a sender) to one or more other endpoints (receivers) which indicate to the receiver(s) that the sender is operational. These messages are commonly referred to as xe2x80x9cI""m alivexe2x80x9d messages. A heartbeat exchange occurs when these communication endpoints pass heartbeat messages bi-directionally, indicating the xe2x80x9clivenessxe2x80x9d of all participating endpoints. In the event of a primary communication failure, this heartbeat mechanism over the redundant link(s) permits an endpoint to know that another endpoint remains active despite an inability to participate in normal cluster communication. Generally, this information is used as a fail-safe to ensure that resource control errors of the type described earlier do not occur.
If a redundant communication link is only used as a heartbeat mechanism, then it provides the cluster node with only enough information to determine that an unsafe condition may exist in which it would be potentially dangerous to take over certain resources. A heartbeat alone may not indicate the exact nature of the condition or reveal information sufficient to recover from it. However, it is sufficient to assure that a cluster node can recognize the existence of an unsafe condition with respect to resource control and take no action which might compromise resource integrity. This is the approach commonly taken. If an unsafe condition with respect to a cluster node is seen, do not attempt to take over any processing resources which may already be under control of that node. It is Better to do nothing than risk the consequences of a mistake.
For example, assume a two node system sharing a disk. The disk contains a database which may only be controlled by one node at a time. A mutual exclusion mechanism in the form of a lock manager operates over a primary network link to assure that only one node updates the database at a time. A heartbeat mechanism operates over a secondary network link. Should the primary link be disabled, negotiation for database access through the mutual exclusion mechanism will also be disabled. However, should the secondary link remain active and heartbeat communication continue to be received, a cluster node will at least be able to recognize the fact that the other cluster node remains active and it would be unsafe to acquire control of the database. This example should only be viewed as illustrative. The mechanisms described are also applicable to clusters of greater than two nodes.
It should be pointed out that while use of a redundant heartbeat link can allow a node to recognize the existence of an unsafe condition, it cannot guarantee recognition of a safe condition. Referring to the previous example, if both the primary and secondary links were to fail, a cluster node would not be able to determine the true nature of the failure. One possibility is that the communication links are intact but the other node has itself failed and is no longer sending messages. Another is that the links have both failed and the other node remains operational but unable to communicate that fact. This points out the essential problem in preventing split-brain operation. It is impossible to guarantee safety of operation against shared resources in the absence of a functioning mutual exclusion mechanism. The best one can do is minimize the probability of accessing such resources under unsafe conditions.
Because of this need to minimize the probability of interpreting an unsafe condition as safe, it is often important not only to utilize multiple links concurrently, but also for those links to be of different types. Further, for each type, the hardware, processing algorithm and operating system code path (communication stack) should be as different as possible. This reduces the possibility of encountering single points of failure within the hardware or operating system.
Generally, primary communication among cluster nodes occurs using higher performance network links, such as Ethernet, FDDI, or Token-Ring. Often, backup links utilizing one of these or a similar mechanism are used to provide cluster communication should the main link fail. Such backup links are helpful as secondary links for split-brain prevention; however, they may not be as reliable as other link types if they share code paths in common with the primary link(s). An example of this would be the TCP/IP communications stack in the operating system. Further, should a backup link take over primary communication, it is no longer useful as a secondary link.
One or more secondary links for split-brain prevention should be of a different type than the primary, both in hardware and operating system code path. For illustrative purposes, there are two commonly used secondary communication mechanisms of note for split-brain preventionxe2x80x94asynchronous (TTY) links and target-mode SCSI.
Use of an asynchronous TTY link to provide a redundant heartbeat connection is a common feature of most failover High-Availability (HA) clustering implementations. When the link transport is done using a different communications stack than regular cluster communication and the associated process(es) run at an appropriate priority, this can be a very reliable method of split-brain avoidance, especially when some amount of cluster state (for example, the list of applications a node thinks it xe2x80x9cownsxe2x80x9d) is also passed along in the heartbeat messages.
Topology issues arise with async links when the cluster expands beyond two nodes. Suddenly we are faced with having to either have Nxe2x80x941 connections per node, or must use some sort of ring topology with two connections per node. We also have an issue of needing to reconfigure the link topology when nodes are added or removed (especially so in the Nxe2x88x921 connections case).
Another problem that occurs as the cluster size grows is one of maintaining proper communication synchronization. For example, with more nodes, more heartbeat messages are in-process simultaneously, increasing the difficulty in maintaining heartbeat timings.
Finally, there are subtle portability issues associated with TTY code in generalxe2x80x94async implementations vary widely in their behavior and are particularly susceptible to driver/hardware idiosyncrasies.
Target mode SCSI is another redundant link alternative which has been used in HA failover cluster implementations. From a high-level perspective, one can think of it as being similar in use to async tty heartbeat links, except that all parties are connected via a common device bus, in this case SCSI. The communication is, however, point-to-point as in async tty. Basically, the SCSI bus is used as a xe2x80x9cback-channelxe2x80x9d communication path between nodes connected to the bus. In addition to any system to device communication over the bus (such as to a disk), there are also system to system heartbeat exchanges.
Target-mode SCSI depends on the same hardware/driver support required for shared SCSI disk. As long as all cluster nodes require shared disk for their application, this approach does not require anything additional for it to work, other than an appropriate heartbeat daemon at each node and of course the operating system support to allow such communication on the bus.
One issue with target-mode SCSI in high volume disk I/O environments is that node to node communication can often be delayed by bus contention issues, resulting in xe2x80x9cfalse-positivesxe2x80x9d (deciding incorrectly that an endpoint is non-operational) if proper safeguards are not followed (adequate time-outs, etc.). As the number of active point to point links over the bus increases, the problem becomes more of a factor.
Deadman timers are another method for preventing split-brain operation. Basically, a deadman timer is a one-way heartbeat mechanism, rather than an exchange among two or more end-points. A deadman has a control point which receives messages, and a sending point which provides messages. If the control point does not receive a message from the sending point within some established time period, it will assume that the sending point is non-operational and will take corrective action. Many deadman mechanisms utilize hardware assists.
For example, there are computer systems containing Service Processors, which operate deadman timers. These processors are capable of stopping or restarting the main processor. Should the main processor fail to provide a message to the deadman timer within a given time period, the service processor will consider the computer system to be non-operational and may effect a shutdown, restart, or other appropriate action. This may prevent corruption of data in a clustered computer system when a node becomes unable to respond and participate in cluster operation.
Though existing methods can provide a high degree of split-brain prevention, certain problems remain. First, the mechanisms are often not directly tied to the critical shared resource(s). Clearly, the better one is able to assure that should the split-brain mechanism fail, the critical shared resource must also fail, the better one can assure that a split-brain condition cannot occur, at least with respect to that resource. Second, without N to N connectivity for split-brain prevention, it is difficult, often impossible, for a cluster node to recognize more than the fact that an unsafe condition is present. Also, certain multiple-node failure scenarios remain problematic.
A clustered computer system, or more simply xe2x80x9ca clusterxe2x80x9d is a set of computers (also called servers or nodes) that are connected to communication networks and often shared devices to allow the set of computers to interact in a coherent manner. The nodes are said to be cluster members, cluster servers, or cluster nodes. The network allows the nodes to send and receive messages.
For the purposes of this invention, the nodes in the cluster are also connected to one or more shared storage resources, typically shared disk devices. During normal operation, programs running on each node will read and write data to the shared device(s). These data accesses from different nodes must be coordinated to prevent unintended or uncontrolled overlays of the data. This coordination is often achieved by sending messages among the nodes over the network and utilizing an appropriate mutual exclusion mechanism, for example, a lock manager.
If a cluster node fails, it can no longer write data to the shared device(s). Therefore, it cannot affect the integrity of the data on the shared device(s). Other cluster nodes can continue to access the shared device(s) and maintain the integrity of that data. This is possible because the non-failing nodes or xe2x80x9csurviving nodesxe2x80x9d can continue to coordinate their data accesses by communicating over the network. If the computer network fails, the normal coordination mechanism is disrupted, and the integrity of the shared data is jeopardized.
A failure of the network may cause two or more groups of nodes to be isolated from each other, where members of one group cannot communicate over the network with members of any other group. These different groups can no longer effectively coordinate their accesses to the shared data. Indeed, one group may believe that the other group has terminated altogether. These different groups of network-connected nodes are called cluster partitions. If nodes in more than one cluster partition were to continue writing data to the shared disks, the data may easily become corrupted. Such a condition is known as split-brain operation.
To prevent the loss of data integrity from split-brain operation, it is necessary to prevent multiple cluster partitions from continuing to access the shared data. It is equally important to prevent a single node that is not connected to the cluster network from starting up and accessing the shared data, in effect forming its own cluster partition.
The present invention provides a method in a computer for handling such partitions of a clustered computer system. The invention provides a mechanism for prevention of split-brain operation in the event of a network communication failure between any subset of the cluster nodes.
Further, the invention provides enhancement over existing split-brain prevention mechanisms in that it permits each cluster node to determine the true membership of the cluster in the absence of primary cluster communication and to resolve the network partition optimally. This allows cluster nodes to not only determine whether an unsafe condition exists, but also to effect actions which will correctly bring about a safe condition and allow resource control to be established and processing to continue among a subset of the cluster nodes.
The present invention includes a computer implemented method for preventing split-brain operation. It includes the ability to both recognize the existence of a partitioned cluster condition and to resolve the partition to permit continued operation, the method including the steps of: maintaining cluster state information on a shared storage device, such as a disk; utilizing this data to determine the cluster communication connectivity as seen by each cluster node; making a determination of the desired cluster membership in the event of a network partition; and effecting the desired cluster membership by voluntarily leaving the cluster or taking other action as required.
In the preferred embodiment detailed in the following more particular description of the invention, software components running on each node are used to detect node or network failures within the cluster. Portions of a shared disk are assigned to be used as a secondary communication link among the cluster nodes. Data in these portions of the shared disk will identify the cluster nodes and indicate each node""s ability to communicate with other nodes over the network. When a node or network communication failure is detected, each node will independently write new data to the disk, read the data written by the other nodes, calculate statistics about any cluster partitions that have been formed, and should a cluster partition be identified, decide on an action the node should take to resolve it.
It is therefore an object of the present invention to provide the ability for cluster nodes in a clustered computer system to determine the existence of a safe or unsafe condition for control of shared resources, even in the event of such degenerate cases as an N-way failure of primary cluster communication.
It is yet another object of the invention to permit successful resolution of a network partition disabling cluster communication among subsets of the cluster nodes by establishing which subset(s) may safely control shared resources.
The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular description of a preferred embodiment of the invention, as illustrated in the accompanying drawings wherein like reference numbers represent like parts of the invention.