In an Internet Protocol (IP) based computer network, data routing algorithms such as Open Shortest Path First (OSPF), Intermediate System-Intermediate System (IS—IS), and Routing Information Protocol (RIP) are used to determine the path that data packets travel through the network. When a link between two network routers fails, the routing algorithms are used to advertise the failure throughout the network.
Most routers can detect a local link failure relatively quickly, but it takes the network as a whole a much longer time to converge. This convergence time is typically on the order of 10-60 seconds depending on the routing algorithm and the size of the network. Eventually, all of the involved routers learn of the link failure and compute new routes for data packets to affected destinations. Once all the routers converge on a new set of routes, data packet forwarding proceeds normally.
Routing algorithms such as OSPF are dependent on the topology of the network, based upon which each node computes the “next hop” routing segment for a packet having a particular source-destination pair. The combined next hop computations of the various nodes in the network result in an end-to-end route being defined for each source-destination pair through multiple nodes. However, traffic considerations within the network are not taken into account by routing algorithms such as OSPF. Thus, although a small number of hops may exist between a particular source node and a particular destination node, the travel time of a packet emitted by the source node will depend strongly on the extent to which the resources of the intermediate links are busy processing traffic.
As a result, packets may experience a long, variable and unpredictable delay as they travel from source to destination. This property is inherent to the dynamic routing characteristics of OSPF and other routing algorithms and is known as “best effort” traffic delivery. The variability and unpredictability of the delay experienced by a packet are even worse following the occurrence of a link failure at some point along the route defined by the next hop information in each intermediate node. In order to recover from the failure, the nodes at either end of the failed link must detect the failure and update their next hop information in order to bypass the failed link.
Typically, some intermediate nodes not located on the original route from source to destination will suddenly become next hops in the alternate route intended to bypass the failed link. This not only forces such new intermediate nodes to spend time computing a set of next hops but also increases the amount of traffic passing through the new intermediate nodes.
The time taken by a node to detect a failure is known as the “detection time” and the time taken by all nodes to converge to an alternate route is known as the “hold-down time”. These times will vary according to the routing algorithm used. In the case of the OSPF routing algorithm, the detection time is at least 0.05 seconds and the hold-down time is at least as long as 2 seconds. In general, therefore, it is impossible to recover from failure of a link before at least 2.05 seconds have elapsed. This minimum overall delay does not even take into consideration the additional delay due to congestion at the nodes or links encountered in the alternate path. Thus, the resulting delay will be on the order of seconds, which is intolerable as far as voice, video, medical or other mission-critical communications are concerned.
Furthermore, the choice of an alternate route may affect the reliability, speed and availability of virtual private networks (VPNs) already established by an Internet service provider (ISP) and paid for by its customers. To maintain customer satisfaction, the ISP may have to provide higher capacity equipment in order to handle any potential increase in traffic in the event of a failure. Due to the mesh architecture of the Internet, the ISP cannot pinpoint where a traffic increase is liable to occur and thus it may have to upgrade all the equipment in the region it serves. Clearly, this requires an added investment by the ISP in terms of high-capacity routers and transport equipment.
Moreover, while the network is converging after a link fails, transient loops can occur which consume valuable bandwidth. Loop prevention algorithms have been proposed to eliminate such transient loops. When using these algorithms, routes are pinned until the network has converged and the new routes have been proven to be loop-free. Although loop prevention algorithms have the advantage that data packets flowing on unaffected routes are not disrupted while transient loops are eliminated, their main drawback is that data packets directed out of a failed link get lost, or “black holed,” during the convergence process. Loop prevention algorithms also extend the convergence time somewhat while new routes are being verified to be loop-free.
Clearly, the industry is in need of a protection switching mechanism that is sufficiently fast to prevent the loss of high-priority traffic ordinarily travelling through one or more failed links, without unpredictably overloading the remaining operational links during a protection mode.