As our national security, economy and even day-to-day life rely more and more on computer and telecommunication networks, avoiding prolonged disruptions to information exchange due to unexpected failures, such as a broken fiber link, becomes increasingly important. Hence, it is critical for a network to be survivable (or fault-tolerant).
Two known survivability schemes are protection and restoration. The major difference between the two is that in protection, recovery from a failure (e.g., the detour set-up and spare capacity allocation) is done at the time of connection setup or network design (i.e., prior to the failure), whereas in restoration, it is dynamically determined after the failure occurs. In general, protection schemes can recover more quickly as the detour is already determined (as long as the detour is not broken), but are less bandwidth efficient than restoration schemes. On the other hand, restoration schemes can survive one or multiple failures (as long as the destination is still reachable, and there is sufficient bandwidth), but they cannot guarantee the recovery time (including the time to find a detour), and/or the amount of information loss for real-time applications, making them unsuitable for mission-critical applications.
In designing a survivable network, the major challenges to be addressed are how to allocate minimal amount of spare resources (e.g., bandwidth) (and thus achieve a maximal efficiency), using scalable (e.g., fast polynomial-time) algorithms, and in case a failure occurs, be able to quickly recover from it (i.e., by re-routing affected traffic using the spare resources). These issues are challenging because the objectives to optimize bandwidth usage, algorithm complexity and recovery time often conflict with each other.
For instance, a common fault-recovery approach is called (failure-independent) path protection, whereby for every mission-critical active path (AP) to be established, a link (or node) disjoint backup path (BP) is also to be established. One way to reduce the amount of spare bandwidth needed is to use shared path protection, which allows the new BP to share the bandwidth allocated to some existing BPs. In order to guarantee the recovery of all critical traffic after a link (or node) failure, two BPs can share bandwidth only if their corresponding APs are link (or node) disjoint (see FIG. 1a for an example). However, finding an optimal pair of link (or node) disjoint paths that minimizes the total bandwidth consumption by the request, given that bandwidth sharing is possible, is an NP-hard problem. In addition, the optimal pair, (if found, by using branch-and-bound techniques), often includes a long BP (consisting of many “zero or super low cost” links on which the bandwidth allocated to existing BPs can be shared), and this results in a long recovery time. Existing efforts in achieving maximal bandwidth efficiency often resort to integer linear programming (ILP) which is not tractable/feasible for large-scale networks. Other existing heuristic approaches sacrifice bandwidth efficiency, or trade recovery time for bandwidth efficiency.
Others have attempted solutions in the past, and we examine these efforts briefly in terms of bandwidth efficiency, algorithm or implementation complexity, and recovery time. Usually, protection schemes can be classified into two types: those used for ring networks, and those used for mesh networks.
Protection schemes for rings are known. The concept of Self Healing Rings or SHR has been applied at the SDH/SONET, Tsong-Ho Wu, Fiber Network Service Survivability, Artech House, 1992, Tsong ho Wu, “Emerging technologies for fiber network survivability,” in IEEE Communications Magazine, Vol. 33, February 1995, pp. 58-59, 62-74, ATM, R. Kawamura, “Architectures for ATM network survivability,” in IEEE Communications Surveys, 1998, pp. 2-11, R. Kawamura and H. Ohta, “Architectures for ATM network survivability and their field deployment,” in IEEE Communications Magazine, Vol. 37, No. 8, August 1999, pp. 88-94, as well as WDM layers, P. Demeester et al., “Resilience in multilayer networks,” in IEEE Communications Magazine, Vol. 37, No. 8, 1999, pp. 70-75.
They can recover quickly (e.g., in 50 ms in SONET), as recovery is either based on 1+1 protection as in Unidirectional Path-Switched Rings or UPSRs, where the receiver (destination) selects a better signal among those arriving along two diverse routes, or based on loopback as in Bidirectional Line-Switched Rings or BLSRs, (also called Shared Protection Rings or SPRINGs), which uses a mechanism called Automatic Protection Switching (APS).
Depending on where a detour originates, mesh protection schemes can be classified into link protection (which re-routes from the immediate upstream node of a failed link), path protection (which re-routes at the source of a connection) or their variations (such as ring-based and non-ring based protection, etc.).
In link protection, for every link carrying traffic under normal (working) situations, called active links, a backup segment or BS (here, the term “segment” loosely refers to one or more consecutive links), from one end of the link to the other end, is used as the detour. This is illustrated in FIG. 1b, where the two active links from node 1 to node 2, and from node 2 to node 3, respectively, are shown in bold, and their corresponding backup segments, denoted by BS1 and BS2, respectively, are show in dashed lines.
In path protection, for every active path or AP from source S to destination D, a BP from S to D is used as the detour. Path protection can be either failure-dependent or failure independent. Failure independent approach means the BP has to be link (node) disjoint with the AP, in order to protect against any single link (respectively, node). FIGS. 1a and 1c show two examples where BPs are node-disjoint and link-disjoint with their corresponding APs, respectively. Failure independent path protection is more common than its failure-dependent counterpart as the former can usually achieve a much faster recovery at little extra cost (in terms of bandwidth consumption) especially if the bandwidth along the non-broken part of the AP is released after the traffic is re-routed onto the BP.
The major difference between link and path protection (even though FIGS. 1b and 1c look similar) is that in link protection, when only the (bold) link from nodes 2 to 3 fails, for example, traffic from nodes 1 to 3 will use the (bold) link from nodes 1 to 2, and then be re-routed to BS2; while in path protection, if the (bold) link from node 2 fails, the traffic from S will be re-routed to BS1 and BS2.
In addition, in link or path protection, backup bandwidth can be shared or non-shared. Usually shared schemes are much more bandwidth efficient (and cost-effective) than non-shared schemes. An example of shared path protection is shown in FIG. 1a. Since a single link (or node) failure will not affect both AP1 and AP2 at the same time, whose bandwidth requirements are assumed to be w1 and w2 (units), respectively, their corresponding BPs can share bandwidth on the common link e. More specifically, the amount of backup bandwidth that needs to be reserved on link e is max{w1, w2} (instead of w1+w2).
Ring-Based Protection is a variation of link or path protection, which uses the links in a mesh network to form a set of rings, and in general relies on loopback or APS for recovery as similarly done in BLSRs. Ring-based approaches include node cover, O. J. Wasem, “Optimal topologies for survivable fiber optic networks using SONET self-healing rings,” in GLOBECOM'91, Vol. 3, 1991, pp. 57.5.1-57.5.7, O. J. Wasem, “An algorithm for designing rings for survivable fiber networks,” in IEEE Trans. on Reliability, Vol. 40, October 1991, pp. 428-432 and ring cover, G. Ellinas et al., “Protection cycle covers in optical networks with arbitrary mesh topologies,” in OFC'00, March 2000, G. Ellinas and T. E. Stern, “Automatic protection switching for link failures in optical networks with bi-directional links,” in GLOBECOM'96, 1996, Vol. 1, 1996, pp. 152-6, G. Ellinas, A. G. Hailemariam, and T. E. Stern, “Protection cycles in mesh WDM networks,” in IEEE Journal on Selected Areas in Communications, Vol. 18, No. 10, October 2000, pp. 1924-1937, W. D. Grover and D. Stamatelakis, “Cycle-oriented distributed precon-figuration: Ring-like speed with mesh-like capacity for self-planning network reconfiguration,” in IEEE International Conference on Communications (ICC'98), Vol. 1, 1998, pp. 537-43. The former chooses a set of rings that can cover all the nodes in a mesh network, but the traffic carried on any uncovered links in the mesh network cannot be protected against failure. The latter also chooses a set of rings, which may cover all the links as in the so-called Cycle Double Cover (CDC) approach, Ellinas, et al., supra, or only some of them as in the so-called Pre-configured protection cycle (P-Cycle) approach, Grover and Stamatelakis, supra, but in either case, every link failure can be recovered.
CDC approach covers each link in a mesh network with exactly two cycles of opposite directions. Though it improved upon the cycle cover methodology, L. M. Gardner, M. Heydari, and et al., “Techniques for finding ring covers in survivable networks,” in GLOBECOM'94, San Francisco, Calif., November 1994, pp. 1862-1866, it has a limited applicability because if the network is (or becomes) non-planar, it is only conjectured that a CDC exists, F. Jaeger, “A survey of the double cycle cover conjecture,” in Cycles in Graphs, North-Holland, Ed Annals of Discrete Mathematics 115, 1985, 1985, p. January 12. Even for a planar graph, it is difficult, if possible at all, to have small protection cycles (so that recovery can take place along shorter detours).
The P-Cycle approach provides a way to protect both covered (or on-cycle) and uncovered (off-cycle) links, resulting in better bandwidth efficiency, Grover and Stamatelakis, supra, D. Stamatelakis and W. D. Grover, “IP layer restoration and network planning based on virtual protection cycles,” in IEEE Journal on Selected Areas in Communications, Vol 18, No. 10, October 2000, pp. 1938-1949. However, detours can also be long, and in addition, the number of p-cycles needed can be large, which requires complicated co-ordination amongst these p-cycles for the purpose of recovery. Also, obtaining optimal solutions is an NP-hard problem, D. Stamatelakis and W. D. Grover, “Theoretical underpinnings for the efficiency of restorable networks using preconfigured cycles (“p-cycles”),” in IEEE Transactions on Communications, Vol. 48 No. 8, August 2000, pp. 1262-1265, and different algorithms to select the p-cycles are needed for link and node failures (unlike in path protection where a simple transformation exists).
More recently, heuristic algorithms to route APs in wavelength-division multiplex (or multiplexed (WDM) mesh networks already “covered” with a set of rings were proposed in F. Poppe, H. D. Neve, and G. H. Petit, “Constrained shortest path first algorithm for lambda-switched mesh optical networks with logical overlay OCh/SP rings,” in IEEE Workshop on High Performance Switching and Routing, 2001, pp. 150-154. Heuristics to protect LSPs in MPLS networks by constructing rings from spanning trees rooted at every possible egress nodes were also studied in Radim Bartos and Mythilikanth Raman, “A heuristic approach to service restoration in MPLS networks,” in IEEE International Conference on Communications (ICC'01), Helsinki, Finland, June 2001, pp. 117-121. Though it was shown that the approach improved over the so-called Fast Rerouting, Dimitry Haskin and Ram Krishnan, “A method for setting an alternative label switched paths to handle fast reroute,” in Draft-haskin-mpls-fast-reroute-05, November 2000, and RSVP backup tunnels, D. O. Awduche, L. Berger, and et al, “RSVP-TE: Extensions to RSVP for LSP tunnels,” in Draft-ietfmpls-rsvp-lsp-tunnel-07, August 2000, Der-Hwa Gan, Ping Pan, and et al., “A method for MPLS LSP fast-reroute using RSVP detours,” in Draft-gan-fast-reroute-00, April 2001, it has a limited flexibility (as other rings-based approaches). More specifically, because it requires that the protection paths for all APs that terminate at a given egress router be determined simultaneously, one cannot take advantage of the bandwidth available somewhere else to support efficient dynamic establishment of connections.
There are several approaches that do not require a ring cover (although BLSR-like loopback may still be used for recovery). In M. Medard, S. G. Finn, R. A. Barry, and R. G. Gallager, “Redundant trees for preplanned recovery in arbitrary vertex-redundant or edge-redundant graphs,” in IEEE/ACM Trans. on Networking, Vol. 7 No. 5, 1999, pp. 641-652, S. G. Finn, M. Medard, and R. A. Barry, “A novel approach to automatic protection switching using trees,” in ICC'97, 1997, pp. 272-276, redundant trees are constructed in such a way that for any link or node failure, every node remains connected to at least one of the trees. In S. G. Finn, M. Medard, and R. A. Barry, “A new algorithm for bi-directional link self-healing for arbitrary redundant networks,” in OFC'98, 1998, p. ThJ4, M. Medard, S. G. Finn, and R. A. Barry, “WDM loop-back recovery in mesh networks,” in INFOCOM'99, 1999, pp. 752-759, M. Medard, S. S. Lumetta, and Y. C. Tseng, “Capacity-efficient restoration for optical networks,” in OFC'00, 2000, pp. 207-9, ThO2, the Generalized Loopback approach, which constructs a primary digraph and the conjugate secondary digraph in a two/four-fiber mesh network, was proposed. When a link failure occurs, recovery starts from one end of a failed link, and follows the secondary digraph in a manner similar to loopback in BLSRs. In S. S. Lumetta, M. Medard, and Y. Tseng, “Capacity versus robustness: A tradeoff for link restoration in mesh networks,” in IEEE Journal of Lightwave Technology, Vol. 18, No. 12, December 2000, pp. 1765-1775, an extension of this approach was proposed, which logically removes some non-critical links in the secondary digraph (so they can carry non mission-critical traffic). This improves the bandwidth efficiency significantly, but results in longer detours and associated degradation of signal transmission quality.
Also related are the two approaches in Ching-Fong Su and Xun Su, “Protection path routing onWDMnetworks,” in Proceedins—OFC, 2001, pp. TuO2-1, Ching-Fong Su and Xun Su, “An online distributed protection algorithm in WDM networks,” in ICC'01, 2001, and Murali Kodialam and T. V. Lakshman, “Dynamic routing of locally restorable bandwidth guaranteed tunnels using aggregated link usage information,” in INFOCOM'01, 2001, pp. 376-385, respectively, where a detour for each link starts from its upstream node but either ends at the node next to the downstream node of the link, or can end at any downstream node (up to the destination of the connection). While they represent an interesting deviation from link/path protection and their ring-based variations, only Integer Linear Programming (ILP) formulations and/or ad hoc heuristics have been proposed. In addition, the bandwidth efficiency of both approaches, especially the first, can be low due to the need to find a detour for each link, and neither approach, especially the second, made any attempts to limit the length of the detour.
Finally, in Pin-Han Ho and H. T. Mouftah, “A framework of a survivable optical internet using short leap shared protection (SLSP),” in 2001 IEEE Workshop on High Performance Switching and Routing, 2001, pp. 21-25, it is suggested that an AP is divided into several segments, each of which is protected using BLSRs. Again, only rudimentary exhaustive search algorithms (with backtracking) and heuristics were suggested, and no performance results in terms of bandwidth efficiency, and recovery time (or the length of the detours) were provided. There have also been many IETF drafts on MPLS protection/restoration schemes (including e.g., Haskin et al., Awduche et al., and Gan et al., suppra. But none of them contains, (or is supposed to contain), any implementation details such as algorithms or performance results, and it is clear that much work needs to be done in exploring the advantage of these protection schemes.
The protection schemes for ring networks have only 50% (or lower) bandwidth efficiency (i.e., the spare bandwidth used for protection is no less than that required to carry the working traffic). The bandwidth inefficiency problem is further exacerbated by the need to upgrade the bandwidth on all the links in a SDH/SONET ring (called the “fork-lift” requirement). In addition, detours (loopbacked routes) can be very long, which not only wastes bandwidth, but also affects signal transmission performances such as signal-to-noise ratio (SNR) and bit-error-rate (BER), making all-optical data communications difficult.
For protection schemes used in mesh networks, link protection uses “local” recovery (re-routing), which is why, in general, it can be faster than path protection which recovers at the source node only (with a few exceptions including the case of 1+1 link/path protection). In general, assuming that some intermediate nodes are capable of failure detection and re-routing, recovery time is proportional to the length of the backup segment that protects against a failure (and possibly the active segment which is affected by the failure as well).
On the other hand, link protection is less bandwidth efficient than path protection, B. Doshi and et al., “optical network design and restoration,” Bell Labs Technical Journal, pp. 58-84, January-March 1999, S. Ramamurthy and B. Mukherjee, “Survivable WDM mesh networks, part I—protection,” in INFOCOM'99, New York, USA, March 1999, pp. 21-25, S. Ramamurthy and B. Mukherjee, “survivable WDM mesh networks, part II: restoration,” in ICC'99, Vol. 3, 1999, pp. 2023-30, Yijun Xiong and Lorne G. Mason, “Restoration strategies and spare capacity requirements in selfhealing ATM networks,” in IEEE/ACM Trans. on Networking, Vol. 7, No. 1, 1999, pp. 98-110, S. Kuroyanagi and T. Nishi, “Optical path restoration schemes and cross-connect architectures,” in GLOBECOM'98, November 1998, pp. 2282-88. For example, as shown in FIG. 1, in link protection, a BS uses 3 links for every active link, thus the backup to active bandwidth ratio is 3 (when there is no backup bandwidth sharing), whereas in path protection, this ratio is 1.5 (also without backup bandwidth sharing).
As a variation of link or path protection, ring-based approaches generally are not bandwidth efficient. In addition, they do not adapt well to the changes in the network topology due to their need to perform major reconstruction of the desired set of rings. It is also found in S. S. Lumetta and M. Medard, “Towards a deeper understanding of link restoration algorithms for mesh networks,” in INFOCOM'01, Vol. 1, 2001, pp. 367-375, that they result in a significantly poor performance in terms of the ability to recover from subsequent failures.
As for existing non-ring based protection schemes, no existing schemes can achieve better bandwidth efficiency than shared path protection while having a much shorter backup segment, as well as a scalable algorithm.