§1.1. Field of the Invention
The present invention concerns Internet Protocol (“IP”) networks. In particular, the present invention concerns recovery from shared risk link group failures using rerouting schemes that determine a node, within an IP network, used for rerouting, wherein the exit address of the determined node is used for IP-in-IP encapsulation.
§1.2 Background Information
With the Internet providing services to more critical applications, achieving high survivability under various types of network failures has become increasingly important. In particular, it is highly desired that services interrupted by network failures resume within a very short period of time to minimize potential loss. (See, e.g., S. Rai, B. Mukherjee, and O. Deshpande, “IP resilience within an autonomous system: current approaches, challenges, and future directions,” IEEE Commun. Mag., Vol. 43, No. 10, pp. 142-149, October 2005.) Fast failure recovery is critical to applications such as distance medical service, real-time media delivery, stock-trading systems, and online gaming, where a long disruption could cause a tremendous loss.
Failures are common in today's network, either because of maintenance mistakes or accidents (e.g., fiber cut, interface malfunctioning, software bugs, misconfiguration, etc.). Despite continuous technological advances, such failures have not been completely avoided. Indeed, statistics show that failures occur quite frequently, even in well-maintained backbones. (See, e.g., A. Markopoulou, G. Iannaccone, S. Bhattacharyya, C.-N. Chuah, and C. Diot, “Characterization of failures in an IP backbone,” in IEEE INFOCOM, March 2004.) It is widely believed that failures will remain unavoidable in the Internet in the foreseeable future, which makes the demand for high-performance failure recovery solutions even more urgent.
In today's IP networks, failures can be recovered from by advertising the failures throughout the network, performing route recalculation, and updating forwarding tables at each affected router. This scheme, while theoretically sound, could cause long service disruptions. (See, e.g., M. Shand and S. Bryant, “IP fast reroute framework,” Internet-Draft (work in progress), February 2008. [Online]. Available: http://tools.ietf.org/html/draft-ietf-rtgwg-ipfrr-framework-08, C. Labovitz, A. Ahuja, A. Bose, and F. Jahanian, “Delayed internet routing convergence,” in SIGCOMM, 2000, pp. 175-187, and “Delayed internet routing convergence,” IEEE/ACM Trans. Netw., Vol. 9, No. 3, pp. 293-306, June 2001.) To achieve fast failure recovery, most IP networks rely on lower layer protection such as using label switched path (“LSP”) protection in multiprotocol label switching (“MPLS”) networks, automatic protection switching (“APS)” in a synchronous optical network (“SONET”), and lightpath protection in IP over wavelength division multiplexing (“WDM”) networks. (See, e.g., V. Sharma and F. Hellstrand, “Framework for Multi-Protocol Label Switching (MPLS)-based Recovery,” RFC 3469 (Informational), February 2003. [Online]. Available: http://www.ietf.org/rfc/rfc3469.txt, T.-H. Wu and R. C. Lau, “A class of self-healing ring architectures for SONET network applications,” IEEE Trans. Commun., vol. 40, no. 11, pp. 1746-1756, November 1992, K. Kompella and Y. Rekhter, “OSPF Extensions in Support of Generalized Multi-Protocol Label Switching (GMPLS),” RFC 4203 (Proposed Standard), October 2005. [Online]. Available: http://www.ietf.org/rfc/rfc4203.txt, W. Lai and D. McDysan, “Network Hierarchy and Multilayer Survivability,” RFC 3386 (Informational), November 2002. [Online]. Available: http://www.ietf.org/rfc/rfc3386.txt, V. Sharma and F. Hellstrand, “Framework for Multi-Protocol Label Switching (MPLS)-based Recovery,” RFC 3469 (Informational), February 2003. [Online]. Available: http://www.ietf.org/rfc/rfc3469.txt, D. Papadimitriou and E. Mannie, “Analysis of Generalized Multi-Protocol Label Switching (GMPLS)-based Recovery Mechanisms (including Protection and Restoration),” RFC 4428 (Informational), March 2006. [Online]. Available: http://www.ietf.org/rfc/rfc4428.txt, L. Sahasrabuddhe, S. Ramamurthy, and B. Mukherjee, “Fault management in IP-over-WDM networks: WDM protection versus IP restoration,” IEEE J. Sel. Areas Commun., Vol. 20, No. 1, pp. 21-33, January 2002, D. Zhou and S. Subramaniam, “Survivability in optical networks,” IEEE Netw., Vol. 14, No. 6, pp. 16-23, November/December 2000, and S. Ramamurthy and B. Mukherjee, “Survivable WDM Mesh Networks part I-protection,” in Proc. IEEE INFOCOM, Vol. 2, 1999, pp. 744-751.) In such schemes, for each working path, a link (or node) disjoint backup path is established. When a failure occurs on a working path, the traffic is immediately switched to the corresponding backup path to resume the service. In 1+1 protection, each protection path reserves dedicated bandwidth. Unfortunately, this incurs high costs because the bandwidth on the protection paths is not used in normal operation. To improve resource utilization, multiple protection paths can be designed to share bandwidth as long as they will not be in use simultaneously (i.e., the corresponding working paths will not fail at the same time), which is called shared path protection. (See e.g., Y. Xiong, D. Xu, and C. Qiao, “Achieving Fast and Bandwidth-Efficient shared-path protection,” J Lightw. Technol., vol. 21, no. 2, pp. 365-371, 2003 and D. Xu, C. Qiao, and Y. Xiong, “Ultrafast Potential-Backup-Cost (PBC)-based shared path protection schemes,” J Lightw. Technol., vol. 25, no. 8, pp. 2251-2259, 2007.) Although path protection is effective, it has the disadvantage of low resource utilization and introduces extra complexity on network design and maintenance. More importantly, using lower layer protection means that the IP layer cannot achieve survivability independently.
In IP over wavelength-division multiplexing (“WDM”) architecture, the logical IP topology is built on top of the physical network, where routers are interconnected through wavelength channels, as shown in FIG. 1. Since each fiber carries multiple wavelength channels, a failure on a fiber results in multiple simultaneous logical link failures in the IP network. These logical links are called a shared risk link group (“SRLG”). (See, e.g., L. Shen, X. Yang, and B. Ramamurthy, “Shared risk link group (SRLG)-diverse path provisioning under hybrid service level agreements in wavelength-routed optical mesh networks,” IEEE/ACM Trans. Netw., Vol. 13, No. 4, pp. 918-931, August 2005, D. Xu, Y. Xiong, C. Qiao, and G. Li, “Failure protection in layered networks with shared risk link groups,” IEEE Netw., Vol. 18, No. 3, pp. 36-41, May 2004.) In FIG. 1, when a fiber cut (depicted by an “X”) occurs, it causes three (3) logical link failures: R1-R3; R2-R3; and R2-R4. The traditional solutions for SRLG failure recovery are to set up a protection wavelength for each logical link, or to establish a backup fiber to protect each fiber. Such protection requires considerable redundant bandwidth and introduces design complexity.
§1.2.1 IP Fast Reroute and Related Work
Recently, a scheme called IP Fast Reroute was proposed to achieve ultra-fast failure recovery in the IP layer without specific requirements on the lower layers. (See, e.g., M. Shand and S. Bryant, “IP fast reroute framework,” Internet-Draft (work in progress), February 2008. [Online]. Available: http://tools.ietf.org/html/draft-ietf-rtgwg-ipfrr-framework-08, M. Shand, S. Bryant, and S. Previdi, “IP fast reroute using not-via addresses,” Internet-Draft (work in progress), February 2008. [Online]. Available: http://www.ietf.org/internet-drafts/draftbryant-shand-ipfrr-notvia-addresses-02.txt, A. Atlas and A. Zinin, “Basic specification for IP fast-reroute: loop-free alternates,” Internet-Draft (work in progress), February 2008. [Online]. Available: http://www.ietf.org/internet drafts/draft-ietf-rtgwg-ipfrr-specbase-11.txt, C. Perkins, “IP Encapsulation within IP,” RFC 2003 (Proposed Standard), October 1996. [Online]. Available: http://www.ietf.org/rfc/rfc2003.txt, S. Lee, Y. Yu, S. Nelakuditi, Z. Zhang, and C.-N. Chuah, “Proactive vs reactive approaches to failure resilient routing,” in IEEE INFOCOM, March 2004, Z. Zhong, S. Nelakuditi, Y. Yu, S. Lee, J. Wang, and C.-N. Chuah, “Failure inferencing based fast rerouting for handling transient link and node failures,” in IEEE Global Internet, March 2005 and A. Kvalbein et al., “On failure detection algorithms in overlay networks,” in IEEE INFOCOM, April 2006.) The basic idea is to let each router find proactively an alternate port for a destination (that is, a port different from its primary forwarding port). In normal operation, the alternate port is not used. After a failure is detected on the primary port, the alternate port is immediately used for packet forwarding. FIG. 2 shows an example of IPFRR in which node g sets g→h as the alternate port to node a. In normal operation, packets going to node a are forwarded through {g,b,a}. When link (or port) g→b fails, the alternate port is immediately used to forward packets through {g,h,e,c,a}.
Since such alternate ports are calculated and configured in advance, IPFRR can achieve ultra-fast failure recovery. A comparison between traditional route recalculation and IPFRR is illustrated by FIGS. 3A and 3B. As shown in FIG. 3A, with route recalculation, the service disruption lasts until the failure advertising, route recalculation, and forwarding table updates are completed. In contrast, as shown in FIG. 3B, the service disruption using IPFRR is greatly shortened by resuming packet forwarding immediately after the failure is detected. In parallel to IPFRR, traditional failure advertising, routing recalculation, and convergence, load balancing, routing table updates, etc., can be performed. Since, however, service is restored while such other (recalculation), activities occur the network can tolerate the longer time needed for these (recalculation) activities.
There are two main challenges when designing IPFRR schemes. The first challenge is ensuring loop-free rerouting. That is, when a node sends packets through its alternate port, the packets must not return to the same node. The second challenge is guaranteeing 100% failure recovery (that is, ensuring recovery from every potential failure).
Existing research on IPFRR focuses mainly on single-link and single-node failures in the IP layer, such as failure insensitive routing (“FIR”). (See, e.g., S. Lee, Y. Yu, S. Nelakuditi, Z. Zhang, and C.-N. Chuah, “Proactive vs reactive approaches to failure resilient routing,” in IEEE INFOCOM, March 2004, Z. Zhong, S. Nelakuditi, Y. Yu, S. Lee, J. Wang, and C.-N. Chuah, “Failure inferencing based fast rerouting for handling transient link and node failures,” in IEEE Global Internet, March 2005.), multiple routing configuration (“MRC”) (See, e.g., “Fast IP network recovery using multiple routing configurations,” in IEEE INFOCOM, April 2006.), routing with path diversity (See, e.g., X. Yang and D. Wetherall, “Source selectable path diversity via routing deflections,” in ACM Sigcomm, 2006.), and efficient scan for alternate paths (“ESCAP”) (See, e.g., K. Xi and H. J. Chao, “IP fast rerouting for single-link/node failure recovery,” in IEEE BroadNets, 2007, “ESCAP: Efficient scan for alternate paths to achieve IP fast rerouting,” in IEEE Globecom, 2007.). One scheme that handles SRLG failures is called NotVia (See, e.g., M. Shand, S. Bryant, and S. Previdi, “IP fast reroute using not-via addresses,” Internet-Draft (work in progress), February 2008. [Online]. Available: http://www.ietf.org/internet-drafts/draftbryant-shand-ipfrr-notviaaddresses-02.txt.). Its principle can be explained using node g in FIG. 2:
1) For potential failure b-g, NotVia removes link b-g, gives node b a special IP address bg, and calculates a path from g to bg, which is {g,h,e,c,a,b};
2) The calculated path is installed in nodes g,h,e,c and a so that they know how to forward packets whose destination addresses are bg;
3) When node g detects a failure on link b-g and then receives a packet {src=x,dst=a} {payload}, it encapsulates the packet as {{src=g,dst=bg} {src=x,dst=a} {payload}}. Since the new packet uses bg as the destination address, it will reach node b through {g,h,e,c,a,b}. This is called IP-in-IP tunneling. (See, e.g., W. Simpson, “IP in IP Tunneling,” RFC 1853 (Informational), October 1995. [Online]. Available: http://www.ietf.org/rfc/rfc1853.txt.)
4) Receiving the encapsulated packet, node b performs decapsulation by removing the outer IP header. The inner part is the original IP packet and is forwarded to node a.
This example shows that NotVia is similar to the link-based protection in MPLS, where the special address bg works like a label to control the forwarding at each node such that the protection path does not traverse the failure. This method can be easily extended to cover SRLG failures. The only modification is to remove all the SRLG links in the first step when calculating the protection paths. As with MPLS link-based protection, NotVia may suffer from long path length in certain situations. In the example, the shortest path from g to a is {g,h,e,c,a}, while the actual path is {g,h,e,c,a,b,a}. As this example illustrates, NotVia produced two unnecessary hops: a→b; b→a.