1. The Field of the Invention
This invention relates to computer clustering systems and in particular to methods for improving the availability and reliability of computer clustering system resources and data in the event of loss of communication between computer clustering system servers.
2. Description of Related Art
A typical computer cluster includes two or more servers and one or more network devices in communication with each other across a computer network. During normal operation of a computer cluster, the servers provide the network devices with computer resources and a place to store and retrieve data. In current computer cluster configurations the computer cluster data is stored on a shared computer disk that is accessed by any of the network servers.
A typical computer cluster is illustrated in FIG. 1, which illustrates two network servers 110 and 120 in communication with network devices 130, 140, and 150 across computer network 101. Both network server 110 and network server 120 communicate with shared disk 104 across communication lines 105 and 106, respectively.
When using a computer cluster, it is often desirable to provide continuous availability of computer cluster resources, particularly where a computer cluster supports a number of user workstations, personal computers, or other network client devices. It is also often desirable to maintain uniform data between different file servers attached to a computer clustering system and maintain continuous availability of this data to client devices. To achieve reliable availability of computer cluster resources and data it is necessary for the computer cluster to be tolerant of software and hardware problems or faults. Having redundant computers and a mass storage device generally does this, such that a backup computer or disk drive is immediately available to take over in the event of a fault.
A technique currently used for implementing reliable availability of computer cluster resources and data using a shared disk configuration as shown in FIG. 1 involves the concept of quorum, which relates to a state in which one network server controls a specified minimum number of network devices such that the network server has the right to control the availability of computer resources and data in the event of a disruption of service from any other network server. The manner in which a particular network server obtains quorum can be conveniently described in terms of each server and other network devices casting “votes”. For instance, in the two server computer cluster configuration of FIG. 1, network server 110 and network server 120 each casts one vote to determine which network server has quorum. If neither network server obtains a majority of the votes, shared disk 104 then casts a vote such that one of the two network servers 110 and 120 obtains a majority, with the result that quorum is obtained by one of the network servers in a mutually understood and acceptable manner. Only one network server has quorum at any time, which ensures that only one network server will assume control of the entire network if communication between the network servers 110 and 120 is lost.
The use of quorum to attempt to make network servers available in the event of a disruption will now be described. There are two general reasons for which server 110 can detect a loss of communication with server 120. The first is an event, such as a crash, at server 120, in which server 120 is no longer capable of providing network resources to clients. The second is a disruption in the communication infrastructure of network 101 between the two servers, with server 120 continuing to be capable of operating within the network. If server 110 can no longer communicate with server 120, its initial operation is to determine if it has quorum. If server 110 determines that it does not have quorum, it then attempts to get quorum by sending a command to shared disk 104 requesting the disk to cast a vote. If shared disk 104 does not vote for server 110, this server shuts itself down to avoid operating independently of server 120. In this case, server 110 assumes that network server 120 is operating with quorum and server 120 continues to control the computer cluster. However, if shared disk 104 votes for network server 110, this server takes quorum and control of the computer cluster and continues operation under the assumption that network server 120 has malfunctioned.
While the use of quorum to enable one of a plurality of network servers to continue providing network resources in the event of a disruption in the network is often satisfactory, the use of a shared disk places the entire network and the data stored on the disk at risk of being lost. For instance, if the shared disk 104, rather than one of the network servers 110 and 120 malfunctions, neither of the servers can operate, and the data may be permanently lost. Moreover, in a shared disk configuration the computer cluster servers are typically placed in close proximity to each other. This creates the possibility that natural disasters or power failures may take down the whole computer cluster.