Network resiliency, defined as the ability of an IP network to recover quickly and smoothly from one or a series of failures or disruptions, is becoming increasingly important in the operation of modern IP networks. Recent large-scale deployment of delay- and loss-sensitive services such as VPN and IPTV impose stringent requirements on the tolerable duration and level of disruptions on IP traffic. In a recent survey of major network carriers including AT&T, BT, and NTT, Telemark concludes that “The 3 elements which carriers are most concerned about when deploying communication services are network reliability, network usability and network fault processing capabilities” (See Telemark, “Telemark survey,” http://www.telemarkservices.com/, 2006). All three relate to network resiliency.
Unfortunately, the current techniques for fault processing to achieve resiliency are still far from ideal. Consider fast-rerouting (FRR) (See M. Shand and S. Bryant, “IP fast reroute framework,” IETF Internet-Draft, draft-ietf-rtgwg-ipfrr-framework-06.txt, 2007), the major currently deployed technique to handle network failures. As a major tier-1 ISP pointed out at Multi-Protocol Label Switching (MPLS) World Congress 2007, there are major practical challenges when using FRR in its business core network (See N. So and H. Huang, “Building a highly adaptive, resilient, and scalable MPLS backbone,” http://www.wandl.com/html/support/papersNerizonBusiness WANDL MPLS2007.pdf, 2007):
(a) Complexity: “the existing FRR bandwidth and preemption design quickly becomes too complicated when multiple FRR paths are set up to account for multiple failures;”
(b) Congestion: “multiple network element failure can cause domino effect on FRR reroute due to preemption which magnifies the problem and causes network instability;”
(c) No performance predictability: “service provider loses performance predictability due to the massive amount of combinations and permutations of the reroute scenarios.”
The importance of network resiliency has attracted major attention in the research community. Many mechanisms have been proposed to quickly detour around failed network devices (See, P. Francois, C. Filsfils, J. Evans, and O. Bonaventure, “Achieving sub-second IGP convergence in large IP networks,” ACM Computer Communication Review, 35(3):35-44, 2005 (Francois et al. 2005), G. Iannaccone, C. Chuah, S. Bhattacharyya, and C. Diot, “Feasibility of IP restoration in a tier-1 backbone,” IEEE Network Magazine, 18(2):13-19, 2004 (Iannaccone et al. 2004), M. Motiwala, M. Elmore, N. Feamster, and S. Vempala, “Path splicing,” Proc. ACM SIGCOMM, 2008 (Motiwala et al. 2008), J. P. Vasseur, M. Pickavet, and P. Demeester, “Network Recovery: Protection and Restoration of Optical, SONET-SDH, and MPLS,” Morgan Kaufmann, 2004 (Vasseur et al. 2004)). The focus of these studies, however, was mainly on reachability only (i.e., minimizing the duration in which routes are not available to a set of destinations). Hence, they do not address the aforementioned practical challenges, in particular on congestion and performance predictability.
It is crucial to consider congestion and performance predictability when recovering from failures. Since the overall network capacity is reduced after failures, if the remaining network resources are not efficiently utilized, serious congestion may occur. As observed in a measurement study on a major IP backbone (See S. Iyer, S. Bhattacharyya, N. Taft, and C. Diot, “An approach to alleviate link overload as observed on an IP backbone,” Proc. IEEE INFOCOM, April 2003 (Iyer et al. 2003)), network congestion is mostly caused by traffic that has been rerouted due to link failures. Meanwhile, it has been shown that focusing only on reachability may lead to long periods of serious congestion and thus violation of service level agreements (SLAs).
However, it may be challenging to derive a routing protection scheme to offer performance predictability and avoid congestion. The main difficulty may lie in the vast number of failure scenarios, which grows exponentially with the number of links to be considered. Consider a tier-1 ISP network with 500 links, and assume that the network would like to find a routing protection plan to protect 3 simultaneous link failures. The number of such scenarios exceeds 20 million! Despite much progress on intra-domain traffic engineering, optimizing the routing simultaneously for just a few hundred network topologies is already beyond the means of any existing technique. As a result, existing routing protection schemes have to either focus exclusively on reachability (hoping that congestion does not occur), or consider only a single link failure (which is insufficient as SLAs become ever more demanding).
Therefore, there is a need for a method and system for deriving a routing protection scheme to provide predictable performance and avoid congestion under one or a series of failures in an IP network.