A computer network is a collection of interconnected computing devices that can exchange data and share resources. In a packet-based network, such as the Internet, the computing devices communicate data by dividing the data into small blocks called packets, which are individually routed across the network from a source device to a destination device. The destination device extracts the data from the packets and assembles the data into its original form. Dividing the data into packets enables the source device to resend only those individual packets that may be lost during transmission.
Certain devices, referred to as routers, maintain routing information that describes available routes through the network. Each route defines a path between two locations on the network. Upon receiving an incoming packet, the router examines information within the packet and forwards the packet in accordance with the routing information.
In order to maintain an accurate representation of a network, routers maintain control-plane peering sessions through which they exchange routing or link state information that reflects the current topology of the network. In addition, these routers typically send periodic packets to each other via the session to communicate the state of the devices. These periodic packets are sometimes referred to as “keepalives” or “hellos.” For example, a first router may send a packet to a second router every five seconds to verify that the router is still operational. The first router may require or otherwise expect the second router to respond to the packet in a certain amount of time. When a response packet is not received in the allotted time frame, the first router expecting the message may conclude a network failure has occurred, such as failure of the second router or failure of the link connecting the two routers. Consequently, the first router may update its routing information to exclude that particular link, and may issue a number of routing protocol update messages to neighboring routers indicating the topology change.
However, a number of non-failure conditions may prevent the second router from responding to the first router within the required periodic response time. Failure to respond due to these and other conditions can result in significant network thrashing and other problems. As one example, the computing resources of the second router may be consumed due to heavy network traffic loads. In other words, with the increased amount of network traffic on the Internet, for example, many conventional routers have become so busy performing other functions, such as route resolution, that the response time to periodic packets is not sufficient. Furthermore, during certain procedures, such as software upgrades or patches, the router may not be able to respond to the periodic packets while it switches from a primary to a secondary or backup routing engine. If the time during which it cannot respond exceeds the allotted time the second router will wait for a response, the first router will signal to the second that it has failed even though the failure is most likely only temporary in these circumstances.
For example, a router may undergo a system software upgrade that causes a switch from a primary routing engine to a secondary routing engine requiring a significant period of time, e.g., five seconds. This time period for the switchover may exceed an allowable response time to a periodic packet received from a peer routing device. By the time the router has switched to the backup routing engine and therefore is able to respond to the periodic packet, the neighboring router may already mistakenly interpret that the router or link has failed. Consequently, the neighboring router may update its routing information to exclude the “failed” router. Furthermore, the neighboring router may send update messages to its neighboring routers indicating the failure, causing its neighboring routers to perform route resolution in similar fashion. Shortly thereafter, the “failed” router may have performed the switch and the backup routing engine (acting as the new primary routing engine) is able to send its neighboring router a response packet indicating that it is operational while a software upgrade of the primary routing engine is performed. As a result, the neighboring router again updates its routing information to include the router and sends another update message to its neighbors, causing the neighboring routers to once again perform route resolution. The unnecessary route resolution and update messages cause the network routers to thrash, creating significant network delays.