In many computer network configurations, a network resource can be accessed and shared by several computers. As used herein, the term “network resource” is any device or application used by computers (also referred to as “nodes”) in the network. A network resource may, for example, be a printer or mass storage device. Often times, the network resource includes at least one controller to manage input and/or output operations at the network resource. The controller may, for example, coordinate data transfers and provide data caching for the network resource.
Because of the controller's critical importance in the operation of network resources, it is common to find at least one backup controller in addition to an active controller within fault-tolerant network resources. For instance, many Redundant Array of Inexpensive Disks (RAID) systems include one active controller and one backup controller.
Although there may be more than one controller present at a network resource, a key principle for ensuring efficient operation of the network resource is that there be only one controller actively used by a node group. When two or more controllers are active at the same time, problems such as disk thrashing or delayed data transmission may occur. To avoid these and other problems, a node group will often coordinate communications with a network resource such that all the node group members utilize the same active controller.
It is possible that one or more nodes in a node group may lose access to the active controller of a network resource due to cabling or other issues. In conventional systems, the node that notices the outage must typically stop all data transfer to and from the network resource, inform the node group of the outage, and wait until the node group selects a new active controller. Such conventional solutions emphasize performance over availability, as they avoid the performance penalty associated with flushing the cache on the active controller. Nevertheless, queuing data until all nodes in the group choose a backup controller requires large memory reserves at each node and can lead to data loss. Furthermore, if consensus cannot be reached on a new controller, the system is forced to fail the input/output operations.