1. Technical Field
This invention generally relates to data processing, and more specifically relates to networked computer systems.
2. Background Art
Since the dawn of the computer age, computer systems have become indispensable in many fields of human endeavor including engineering design, machine and process control, and information storage and access. In the early days of computers, companies such as banks, industry, and the government would purchase a single computer which satisfied their needs, but by the early 1950's many companies had multiple computers and the need to move data from one computer to another became apparent. At this time computer networks began being developed to allow computers to work together.
Networked computers are capable of performing tasks that no single computer could perform. In addition, networks allow low cost personal computer systems to connect to larger systems to perform tasks that such low cost systems could not perform alone. Most companies in the United States today have one or more computer networks. The topology and size of the networks may vary according to the computer systems being networked and the design of the system administrator. It is very common, in fact, for companies to have multiple computer networks. Many large companies have a sophisticated blend of local area networks (LANs) and wide area networks (WANs) that effectively connect most computers in the company to each other.
With multiple computers hooked together on a network, it soon became apparent that networked computers could be used to complete tasks by delegating different portions of the task to different computers on the network, which can then process their respective portions in parallel. In one specific configuration for shared computing on a network, the concept of a computer “cluster” has been used to define groups of computer systems on the network that can work in parallel on different portions of a task.
Clusters of computer systems have also been used to provide high-reliability services. The high reliability is provided by allowing services on a server that fails to be moved to a server that is still alive. This type of fault-tolerance is very desirable for many companies, such as those that do a significant amount of e-commerce. In order to provide high-reliability services, there must be some mechanism in place to detect when one of the servers in the cluster becomes inoperative. One known way to determine whether all the servers in a cluster are operative is to have each server periodically issue a message to the other servers indicating that the server that sent the message is still alive and well. These types of messages are commonly referred to in the art as “heartbeats” because as long as the messages continue (i.e., as long as the heart is still beating), we know the server is still alive.
In the prior art, when a server becomes invisible due to lack of a heartbeat, a server in the cluster that is designated as a manager assumes the server that no longer has a heartbeat has failed. As a result, the manager must provide the resources that were on the failed server on another server in the cluster. Note, however, that the absence of a heartbeat does not always mean a server is dead. For example, a server may not provide a heartbeat because it may be temporarily unresponsive due to trashing, swapping, network floods, etc. If the server is not giving heartbeats but is still alive, there exists the possibility that the server may once again become responsive and start providing heartbeats. If the manager has already assumed the server has failed, and has provided the server's services on another server, we now have two servers that try to provide the same services. This creates a problem in administrating the cluster. One way to deal with this problem is to monitor data for a service to make sure that two servers don't try to access the same data for the same service. However, this is complex and inefficient. Without a mechanism for assuring that services in a computer cluster are not duplicated when a server failure is detected, the computer industry will continue to suffer from inadequate and inefficient ways of handling a failed server in a computer cluster.