MPLS Technology
MPLS is a relatively new technology for fast delivery of packet-based traffic along pre-established logical paths called label switched paths (LSPs, a.k.a. tunnels), that can be provisioned over virtually any packet transport technology. MPLS supports traffic engineering (TE) to optimize usage of network resources, and is designed to offer a reliable traffic delivery, with predictable quality of service (QoS) and capacity (a.k.a., bandwidth) guarantees. MPLS uses a notion called label to identify, classify and forward data over LSPs.
A point-to-point (P2P) LSP delivers traffic from a source (a.k.a., ingress) node (a.k.a., label switching router, LSR) downstream to a destination (a.k.a., egress) LSR. The LSP may traverse intermediate (a.k.a., transit) LSRs.
FIG. 1 illustrates a P2P LSP that originates at ingress LSR1, traverses through transit LSR2 (from port “A” to port “B”) and transit LSR3, and then terminates at egress LSR4. The LSP path may be summarized as 1-2-3-4.
A point-to-multipoint (P2MP) LSP delivers traffic from ingress LSR (a.k.a., root) downstream to one or more egress LSRs (a.k.a., leaf leaves). It is a tree-and-branch structure, where traffic is replicated at transit branch points and sent to the leaves. This scheme is efficient in terms of link capacity utilization, because only one copy of the traffic is ever sent per branching link.
FIG. 2 illustrates a P2MP LSP. Traffic is sent from ingress LSR1 to LSR2, where it is replicated towards leaf LSR3 and leaf LSR4. As illustrated by dashed gray lines, there are 2 sub-LSPs, with paths 1-2-3 and 1-2-4.
Note that LSR1 sends only one packet copy to LSR2, even though the link to LSR2 carries multiple sub-LSPs.
Fast Reroute (FRR)
A major MPLS feature is the support of fast reroute (FRR). FRR is a mechanism for rapid traffic restoration upon a link or node failure occurring along an LSP path. With FRR, an interrupted traffic stream can be rerouted around a failed node/link within a time period of under 50 milliseconds, thereby minimizing impact on the traffic.
The LSR located upstream of the failure (a.k.a., point of local repair, PLR), redirects the traffic of the so-called Working LSP onto a pre-established (P2P) backup LSP (a.k.a., bypass LSP), which bypasses the failure. The backup LSP is used to convey the traffic from the PLR to an LSR located downstream the failure (a.k.a., merge point, MP), after which the traffic returns to the Working LSP. The MP is also the egress LSR of the backup LSP.
For the sake of simplicity, it will be assumed hereinafter that the MP is the closest LSR downstream the failure. Accordingly, with FRR link protection, the MP is the next-hop (NH) LSR, i.e., the LSR at the far end of the protected link; with FRR node protection the MP is the next-next-hop (NNH) LSR, i.e. the LSR that follows the NH along the Working LSP path. The backup LSP may be shared, i.e. provide protection to multiple Working LSPs, in which case it is known as Facility backup LSP.
It will also be assumed that the failure of the protected link triggers switchover to backup LSP. This procedure provides fast detection time, because it is based on rapid physical layer indications. Examples for such indications are loss of signal, signal quality degradation, and alarm indication.
FIG. 3 illustrates FRR for a P2P Working LSP path whose path includes 1-2-3. Backup LSP B1 protects against the failure of the link from LSR2 to LSR3. It originates at PLR LSR2 and terminates at MP NH LSR3. Backup LSP B2 protects against the failure of LSR2, as well as the failure of the link 1-2. It originates at PLR LSR1 and terminates at NNH LSR3. The backup LSP path includes also transit LSR4.
(i) Link Protection Scenario: when the link 2-3 goes down, the PLR LSR2 redirects the traffic to B1 that forwards it to MP LSR3, after which the traffic returns to the Working LSP.
(ii) Node Protection Scenario: when LSR2 or the link 1-2 goes down, both detected by PLR LSR1 via the failure of the link to LSR2, the PLR redirects the traffic to B2, which in turn forwards it to MP LSR3, after which the traffic returns to the Working LSP.
Multi-Failure Problem
The problem with the FRR scheme described above is that it does not protect against concurrent failures along both the Working and the backup LSP. As may be seen in FIG. 3, when both link 1-2 along the Working LSP path and link 1-4 and/or 4-3 along the backup LSP path fail, all traffic goes down.
A real-life application for multi-failure protection is illustrated at FIG. 4. A network is comprised of two topological rings 1 and 2. Ring 1 is formed by LSRs 1-2-3-7-6, while ring 2 is formed by LSRs 3-4-5-8-7. The rings are interconnected via LSR3 and LSR7. The links are usually realized with optic fibers. A Working LSP 1-2-3-4-5 is protected against the failure of LSR3 and 2-3 via backup LSP 2-1-6-7-8-5-4.
(i) Node Protection Scenario: side A marks the failure of node 3. When link 2-3 fails, LSR2 (PLR) implies that LSR3 (NH) is down and redirects the traffic to the backup LSP along which it is conveyed to MP LSR4 (NNH). The successfully recovered traffic continues over the Working LSP towards LSR5.
(ii) Dual Link Failure Scenario: side B marks two link failures (“fiber cuts”), link 2-3 and link 8-5. When link 2-3 fails, LSR2 (PLR) redirects the traffic to the backup LSP, intended to bring it to MP LSR4 (NNH), after which the traffic would continue over the Working LSP towards LSR5. However, since link 8-5 failed too, the traffic reaches a dead end at LSR8 and cannot be recovered.
While concurrent node (LSR3) failure and fiber cut (link 8-5) are usually rare, a double fiber cut (e.g. 2-3 at Working path and 8-5 at backup LSP path) is common in some networks, at which case their service providers may require a solution to this problem. There are a number of solutions in the prior art, which try to resolve similar problems. Namely:
draft-ietf-mpls-p2mp-te-bypass-02.txt proposes P2MP bypass LSPs for protecting P2MP Working LSPs which requires the PLR to be capable of detecting the exact failed elements (e.g., whether a link or rather a node failed), after which it can activate the appropriate protection.
“The PLR needs to localize the failed elements in order to activate the P2MP Bypass Tunnel(s) protecting this element. Mechanisms through which this location is retrieved are out of the scope of this document.”
“The PLR may be directly upstream to the protected link or node or may also be two or more hops upstream. In case the PLR is not directly upstream to the failure, rerouting within the Bypass Tunnel(s) may be triggered by the following events: Failure of a BFD session between the PLR and the protected Element; A Path that indicates the location of the failed Element.”
The main drawback of this method is that it requires the PLR to distinguish between node and link failures. This often requires exchange of signaling, which complicates the solution and slows down the recovery time.
draft-vasseur-mpls-linknode-failure-00.txt (also described in US 2003233595) uses a specific method for distinguishing between a link failure and a node failure. It uses “Hello” message exchange over an alternative path between the PLR and NH for detecting when can the NH not be reached:
[section 5] “ . . . the PLR uses the RSVP hello to determine whether its neighbor is reachable via another path than the failed link. If this is the case, the PLR can conclude of a link failure. If not, the failure is a node failure.”
[section 7] “Once the link failure has been detected by the PLR . . . there is a period of time during which the PLR does not know whether the failure is a link or a node failure . . . ”
Like the previous method, the main drawback of this method is that it requires the PLR to distinguish between node and link failures, which as explained above often requires signaling exchange, which in turn complicates the solution and slows down the recovery time.
US 2011/0110224 discloses a dual FRR method, which provides both link and node protection, where the NH applies blocking rules to avoid traffic duplication:
“to use backup LSP(s) to provide both link and node protection concurrently (thus initiating so-called Dual or concurrent FRR), while configuring a suitable blocking rule at the link protection merge point (the NH), to avoid traffic duplication that would otherwise occur with standard FRR.”
The main drawback of the disclosed method is the need to replicate traffic at the NNH (called NNHOP) which consumes extra (twice) resources at the NNHOP, where internal capacity resources are often limited. This is especially undesired when protecting a P2P Working path, where there is no reason to carry out packets' replications.
“Traffic arriving on B3 to NNHOP LSR3 returns to the working LSP towards LSR4. Since NNHOP LSR3 is a transit & leaf LSR for B3, traffic also continues on B3 towards LSR2”