1. Field of the Invention
The invention relates to fault detection and recovery in networks including token-passing ring networks.
2. Description of Related Art
A token passing ring is a means for interconnecting a network of computer systems. The computer machines are connected in series, with the output of one machine feeding the input to the next. Each machine represents a communication node of the system, and a network is created in the form of a unidirectional communications ring. Any node may originate a message, which is then passed from one node to the next until it arrives at its destination node. Normally, if the ring is broken anywhere, the ring goes down and the machines on the ring can no longer communicate with each other.
In installations of more than a few nodes, the ring is typically divided into subrings. These subrings are connected together at one or more central locations. At these central locations there are manual switches that allow each subring to be switched in and out of the network. Large networks may have a hierarchy where subrings themselves are divided into smaller subrings. When the network becomes broken, an operator must switch the malfunctioning subrings off the main network. This restores service to all the machines except those on the malfunctioning subring.
When the ring is operating, a token (a message used to control access to the network) is passed around the ring from node to node. Only the node which has possession of the token may initiate communication. However, a network failure can cause the token to be lost and thereby create a disruption for network users. It takes time to find out that the network is down, locate the malfunctioning subring, and then flip the appropriate switch.
It is not uncommon for users to run jobs overnight and in some cases jobs that take several days. If the network fails and is down for an extended period of time, the communication lapse between nodes working together could cause one of these jobs to abort. This is very undesirable, since the job will usually have to be started over again.