In modern computer systems, computers may communicate with each other and with other computing equipment over various types of data networks. Routable data networks are configured to route data packets (or frames) from a source network node to one or more destination network nodes. As used herein, the term “routable protocol” refers to a communications protocol that contains a network address as well as a device address, allowing data to be routed from one network to another. Examples of routable protocols are SNA, OSI, TCP/IP, XNS, IPX, AppleTalk, and DECnet. A “routable network” is a network in which communications are conducted in accordance with a routable protocol. One example of a routable network is the Internet, in which data packets are routed in accordance with the Internet Protocol (IP). In a routable data network, when a network routing device (or router) receives a data packet, the device examines the data packet in order to determine how the data packet should be forwarded. Similar forwarding decisions are made as necessary at one or more intermediate routing devices until the data packet reaches a desired destination node.
Network routers typically maintain routing tables that specify network node addresses for routing data packets from a source network node to a destination network node. When a data packet arrives at a router, an address contained within the packet is used to retrieve an entry from the routing table that indicates the next hop (or next node) along a desired route to the destination node. The router then forwards the data packet to the indicated next hop node. The process is repeated at successive router nodes until the packet arrives at the desired destination node. A data packet may take any available path to the destination network node. In accordance with IP, data packets are routed based upon a destination address contained in the data packet. The data packet may contain the address of the source of the data packet, but usually does not contain the address of the devices in the path from the source to the destination. These addresses typically are determined by each routing device in the path based upon the destination address and the available paths listed in the routing tables.
Distributed computer systems typically include a number of recovery functions that increase the reliability of process execution. For example, various checkpointing and restoration (or rollback) techniques have been developed for recovering from hardware and software failures. In accordance with these techniques, the information that is needed to re-execute a process in a given state is stored at a number of checkpoints during the operation of the process. If the execution of the process is interrupted (e.g., the process has failed or is hung), the state of the interrupted process is rolled back to the checkpoint state immediately preceding the interruption, and the process is re-executed from that checkpoint state.
Various heartbeat fault detection schemes also have been developed. In relatively small network environment in which the topology and membership information is known, such heartbeat-based fault detection schemes typically involve heartbeat monitors that are installed at each node of the distributed system to probe the health of each associated node. In general, heartbeat monitors require constant monitoring of all nodes of the system. A heartbeat monitor may probe the health of an associated node process by, for example, detecting if the node process has failed, monitoring the node log file for any indication of process failure, exchanging messages with the node process, or making calls to the node system manager to determine if the system is operating properly. If a heartbeat monitor detects that a particular node process has failed, it may attempt to restart the process or notify a network management system (or console) of the failure, or both.
Still other network recovery schemes have been proposed.