The present disclosure relates to networking technologies, and more particularly to techniques for reducing the latency in performing a failover from a protected connection to its backup connection when a network event is detected affecting the protected connection.
Connection-oriented protocols such as Multi-Protocol Label Switching (MPLS) are widely used to transport packets across computer networks. MPLS is an example of a label or tag switching protocol that provides a data-carrying mechanism which emulates some properties of a circuit-switched protocol over a packet-switched network. In MPLS, a connection referred to as a label switched path (LSP) is established between two end points and packets are transported along the LSP using label switching. An LSP is unidirectional and represents a tunnel through which data is forwarded across an MPLS network using label/tag switching. Various signaling protocols may be used to set up and manage LSPs. Examples include Resource Reservation Protocol (RSVP) and its various extensions such as RSVP-TE for traffic engineering, and others. RSVP-TE provides a mechanism for reserving resources for LSPs.
Routers that are capable of performing label-based switching according to the MPLS protocol are referred to as label switch routers (LSRs). The entry and exit points of an MPLS network are called label edge routers (LERs). The entry router is referred to as an ingress LER and the exit router as an egress LER. Routers in between the ingress and egress routers are referred to as transit LSRs. LSPs are unidirectional tunnels that enable a packet to be label switched through the MPLS network from an ingress LER to an egress LER.
The flow of packets along an LSP may be disrupted by various network events such as failure of an interface or link along a path traversed by an LSP, failure of a node (e.g., a router) in the LSP path, reduction in bandwidth associated with a link traversed by the LSP, a priority-related event such as when a new high priority LSP comes up and there is bandwidth contention or a change in priority of an existing LSP, which may cause lower priority LSPs to get preempted, and others. To protect against potential data losses caused by such disruptions, a backup LSP may be provisioned for an LSP (referred to as the primary LSP to differentiate it from the backup LSP). The backup LSP provides an alternative path for forwarding packets around a failure point in the primary LSP. Since the primary LSP is “protected” by its corresponding backup LSP, the primary LSP is referred to as a protected LSP.
The Fast ReRoute (FRR) extension to RSVP-TE provides a mechanism for establishing backup LSPs for protecting primary LSPs. The protected LSP is also referred to as an FRR-enabled LSP. When a network event occurs that affects a protected LSP, the packet traffic is locally redirected along the backup LSP in a manner that circumvents the failure point in the protected LSP. When a router starts redirecting data along a backup LSP for a protected LSP, the protected LSP is referred to as being failed over to the backup LSP. FRR enables RSVP to set up a backup LSP to protect an LSP so that in the event of a network failure (e.g., link/node failure), the data traffic from the protected LSP can be switched to the backup LSP to minimize any traffic loss.
A network event may adversely affect multiple LSPs and trigger a failover of these LSPs. The failover operation has to be preferably performed in a timely manner to minimize loss of packets such that the services provided by the affected LSPs are not adversely impacted. This is especially important for real-time applications such as voice services. In many instances, the failover needs to be performed in under 50 msec to make the failover undetectable to the end user. Accordingly, in several instances, it is essential that the data traffic over an LSP not suffer a loss of more than 50 msec in the event of an FRR triggering event.
A network event can trigger the failover of multiple LSPs. This may occur, for example, when an interface used by mutliple LSPs fails. When such an event occurs, the affected LSPs are failed over in a sequential manner. The time needed for failing over multiple LSPs is thus directly proportional to the number of LSPs being failed over. If the number of LSPs being failed over is small, the failover time and data traffic recovery of under 50 msec is generally achievable. However, if the number of LSPs that have to be simultaneously failed over is large, the total recovery time needed for the failover operation can exceed the 50 msec mark. For example, if there are 10K protected LSPs that use a common link. When that common link fails, all the 10K LSPs need to be failed over to their backup LSPs. In doing this, may be the first up to 5K protected LSPs can be switched over to their backup LSPs in 50 msecs but for the rest it takes more than 50 msec. Accordingly, depending upon the number of protected LSPs that need to be failed over, the 50 msec may be exceeded.
In order to reduce the time needed for failovers, in RSVP FRR the backup LSPs are computed and signaled in advance of a failover triggering event. As a result, the time needed to signal and create a backup LSP after the occurrence of a failover triggering event is eliminated. However, in spite of this, the manner in which failovers are handled conventionally by routers is such that it adds significant latency to the failover operation. For example, when a transit LSR along the path of a protected LSP detects a downstream interface/node failure (or other failure events), a control processor of the transit router has to perform extensive RSVP states processing to facilitate failover of the traffic to backup LSPs. Further, the amount of messaging and signaling that takes place within the various components of a router performing the failover also adds to the overall failover processing time. This all leads to increased latency for failovers.