1. Technical Field
The present invention relates to remotely monitoring server activity. More particularly, the invention concerns enabling servers at a site to determine if servers at a remote peer site are operational, even when private and public communication links between the sites are not functioning.
2. Description of Related Art
In high performance computing systems, two or more servers (also called nodes) are often connected to form a cluster. During normal operation, at any given time only one server in a cluster can be in control of the cluster, and the server that is in control of the cluster is referred to as “owning” the cluster at that time. Whichever server gains control of a common resource (for example, a quorum disk) owns the cluster. An undesirable erroneous condition called “split brain” results if two or more servers operate as if they have read or write access to the same data at the same time and believe they each are in control of the cluster at the same time. The split brain condition can result if a first server erroneously concludes that a second server is not operational, when the second server is operational. However, the split brain condition can be avoided by providing communications between the servers in a cluster. For example, server clustering software may use TCP/IP based communications for transmitting communications, referred to as “heartbeat communications”, between nodes in a cluster for determining if all of the nodes are operational. The heartbeat communications may be transmitted over a private communications link, which may be a private Ethernet link, between the servers in a cluster. The servers in a cluster may also be connected via a public communications link, which may be an Ethernet link, which is used to connect one or more client computers that may be running application programs, to the servers. The public Ethernet link may be used as a backup communications link for transmitting heartbeat communications between the servers in a cluster. If heartbeat communications are disrupted due to malfunctions of both the private and public Ethernet links, a common disk resource may be used to determine which node owns cluster operations in systems in which the servers are not separated by significant distances. For example, clustering solutions such as Microsoft Cluster Server can use a common disk resource, called a quorum disk, to determine which node owns cluster operations. However, if the servers are separated by significant distances, for example several or tens or hundreds of kilometers, there is no common disk resource, and there is no way to determine which node owns cluster operations if both the private and public Ethernet links fail.
As an example, there is no common disk resource when data is mirrored from a primary site to a remote secondary site. Many computer users, for example businesses and governments, update and store data at a primary site, and maintain a backup copy of the data at a secondary site that is physically remote from the primary site, which can be used for recovery from a disaster at the primary site. Making a backup copy of data at a remote secondary site is referred to as data mirroring, data shadowing, data duplexing, or remote copying. The servers at the primary site may be separated from the servers at the secondary site by significant distances, for example hundreds of kilometers. The International Business Machines (IBM) Peer-to-Peer Remote Copy (PPRC) facility is an example of a widely used synchronous remote data mirroring system.
The significant distance between remote data mirroring sites makes it unlikely that a secondary site will be harmed if there is an equipment failure or other disaster that damages or destroys data at the primary site, thereby permitting data to be recovered from the secondary site in the event of an equipment failure or disaster at the primary site. However, the distance between the primary and secondary sites precludes using a common disk resource for determining which node owns cluster operations, if both the private and public Ethernet links fail. Consequently, if the servers in a cluster are separated by significant distances, such as in a remote data mirroring system, and if both the private and public Ethernet links between the servers are not functioning, undesirable split brain operation can result because there is no way for servers at one site to determine if servers at another site are operational.