Today's highly complex networks, such as the Internet, comprise thousands of routers, and myriad links connecting the various routers. This complex mesh enables almost any machine (e.g. client, server, etc.) to access any machine, and provides great flexibility in determining the route from one machine to another. Because of the number of components involved, and the highly complex and delicate nature of the components, failures in a complex network are inevitable. These failures may be caused by software crashes, hardware defects, or human error (e.g. someone accidentally unplugging a card from a slot or construction work severing a fiber or cable). Because failures are inevitable, it is important to implement one or more failure recovery mechanisms in a network so that when a failure does occur, it does not unduly disrupt network traffic or lead to catastrophic results. Overall, the failure recovery mechanism should minimize the impact that a failure has on a network. With this goal in mind, several recovery strategies have been developed and implemented in the prior art.
A first strategy, typically implemented at the physical layer of a network, involves the use of redundant links. Under this approach, between any two components for which failure recovery is implemented, there is provided two separate links. One of the links is used to carry all of the traffic, while the other link remains idle. When a failure is detected on the currently active link, all of the traffic is detoured to the previously idle link, and all traffic is thereafter carried on that link. Since this strategy is implemented on the physical layer of the network, the switchover in links is transparent to components on the upper layers of the network. Hence, recovery from the failure is carried out seamlessly and transparently.
This approach has a number of significant drawbacks, however. The first is high cost. Because multiple links need to be maintained between all components for which failure recovery is desired, the cost of the network in terms of links is multiplied. Network components are currently already expensive. Increasing the cost by a multiple would render this approach impracticable in many implementations. A second drawback is inefficiency. Notice that only one of the links is used at any one time. This means that, at most, the best efficiency that can be achieved is 50%. Another drawback of this approach is that it has relatively slow recovery speed. In an optical network, for example, where this approach is implemented on the physical SONET layer using APS, it takes approximately 50 ms (milliseconds) to implement a full recovery once failure has been detected. In terms of network traffic, 50 ms is a fairly long time. This is especially true in light of the fact that during the entire recovery time, all traffic directed to the failed link is lost. Given the shortcomings discussed above, the redundant links approach does not provide satisfactory results.
Another approach that has been implemented involves the use of topology information at the routing layer of a network. Under this approach, whenever a router detects a failure adjacent to itself (e.g. a link failure or a router failure), the router: (1) updates its topology information and forwarding tables to route around the failed link or router so that the failed component is no longer referenced or used in the future; and (2) broadcasts information pertaining to the failure to all of its adjacent routers. This information broadcast may be made using a link state protocol, such as IS-IS (intermediate system-intermediate system), BGP (border gateway protocol), or OSPF (open shortest path first), to name a few. Upon receiving the failure information, each adjacent router in turn: (1) updates its topology information and forwarding tables to route around the failed component; and (2) broadcasts the failure information to all of its adjacent routers. As the failure information propagates from router to router in the manner described, the topology information for the entire network eventually converges to the point where none of the routers in the network references or sends information to the failed component anymore. Once that convergence takes place, the failed component is no longer used, and recovery from the failure is complete.
The main problem with this approach is that it is extremely slow. In a typical large-sized network, it requires approximately 30 seconds for the topology information of the entire network to converge. During this time, traffic continues to be routed to the failed component and dropped. In 30 seconds, a vast amount of traffic can be lost, and if any of this traffic is time-critical, such as streaming video or audio, or unrecoverable, the consequences can be grave. In short, this approach is just too slow to be practicable in many if not most implementations. As a result, an improved mechanism for recovering from a network failure is needed.