Most business organizations satisfy their computing needs using computer networks, i.e., large numbers of individual computers and other networked devices interconnected for the exchange of information. These networks themselves are typically interconnected to facilitate the exchange of information (such as e-mail and data files) within and between business organizations.
One example of a network of interconnected networks is presented in FIG. 1. An individual computer network 1001, 1002, 1003, 1004 (generically 100) typically includes one or more computers or other end system devices typically connected using a shared or switched network using a medium access control (MAC) protocol such as Ethernet or token ring. The networks 100 are representative of local area networks (LANs) that exist in business enterprises. The end system devices in computer network 100 may include, for example, mainframe computers, minicomputers, personal computers, network computers, printers, file servers, and network-enabled office equipment. Each end system device on a network 100 is associated with one or more addresses that can identify the data it has sent or serve as an identifier for data it is intended to receive.
The networks 1001, 1002, 1003, 1004 connect with a series of routers 1041, 1042, 1043, 1044 (generically 104) at network connections. These network connections are typically dedicated point-to-point circuits (i.e., PPP, HDLC, T1) or virtual circuits (i.e., Frame Relay, ATM, MPLS). Two or more routers 104 may be interconnected using telecommunications lines that span greater distances that form a wide area network (WAN) 1101, 1102 (generically 110). A router 104 is a specialized computer that, generally speaking, directs the exchange of data among devices on different computer networks 100 or between other routers 104 as shown in FIG. 1. For example, router 1041 takes data from a first network 1001 and forwards it to a second router 1044 and a third router 1042 towards network 1002. Some routers 104 are multiprotocol routers that execute several routing protocols independently (e.g., OSPF, IS-IS, RIP). The router 104 maintains information in a routing table identifying which groups of addresses are associated with particular network connections. Note that routers 104 can also exist within networks 100 interconnected with local area network technology (e.g., Ethernet, token ring).
In normal operation, a router 104 is subject to various failure modes that render it unable to transfer data between its connected networks 100. These failure modes include but are not limited to physical severance of the connection between the router 104 and another router 104 or one or more of its connected networks 100, an error in the software internal to the router 104, physical damage to the router 104, loss of power, or another hardware problem with the router 104. If router 104 fails in a way that is amenable to recovery—e.g., a software failure or a transient hardware failure—the router 104 can reset and reinitialize itself for operation.
During the reinitialization process neighboring routers 1041, 1042, and 1043 typically detect the failure of router 1044. Each router 104 will typically broadcast a route update message to the other routers 104. For example, a BGP router will provide an update including a withdrawn path identifying the failed router 104 as well as a replacement routing path circumventing the failed router 104. Unfortunately, this transmission of control traffic consumes time and bandwidth otherwise available to carry data between networks 100. If the update messages from different routers 104 should occur at the same time, significant amounts of network bandwidth may be commandeered for the transmission of control traffic.
In response to these update messages, routing paths are recomputed to circumvent router 1044. For example, router 1041 may alter its routing tables to use routers 1043 and 1042 to reach network 1001, instead of using router 1044. Typically each router 104 will detect the failure and begin route re-computation at approximately the same time. This recomputation consumes processor time on the router's control plane reducing the computational resources available for other management functions.
Moreover, it is not unusual for a router 104 to announce a new path in an update message, simultaneously receive a message from another router 104 invalidating the newly-announced path, and subsequently issue another message withdrawing the path it has just announced, before computing and announcing a new path. This phenomenon is referred to as “network flap” and it potentially has several detrimental effects on network operations. As discussed above, network flap consumes computational resources and network bandwidth. Network flap can also destabilize neighboring routing domains, causing network destinations to be “unreachable” for significant periods of time, thereby disrupting data traffic.
Therefore, there is a need for improved router recovery mechanisms that require reduced computational resources and less bandwidth than prior art recovery mechanisms. Moreover, these mechanisms should operate without destabilizing neighboring routing domains, and should not depend on the retrofitting or modification of currently-deployed network protocols. The present invention provides methods and apparatus related to such mechanisms.