Communication networks require mechanisms for automatic recovery from network failures. These mechanisms may be different for different types of failures, for example for control-level and data-level failures, and may depend on network type. Legacy networks are often based on SONET/SDH systems, wherein network failures typically imply simultaneous control-level and data-level failures because control messages and user information are transmitted together in frames.
MPLS (Multi-Protocol Label Switching) represent an evolution in the routing architecture of IP packet-based networks, wherein data is forwarded using labels that are attached to each data packet. These labels must be distributed between the nodes that comprise the network. MPLS does not replace IP routing, but works alongside existing routing technologies to set-up label-switched paths (LSPs) between ingress and egress nodes, and to provide very high-speed data forwarding at Label-Switched Routers (LSRs) together with reservation of bandwidth for traffic flows along each LSP with differing Quality of Service (QoS) requirements.
Benefits of using MPLS based network architecture include, e.g., better price/performance in routers, scalability, better integration with circuit switched technologies such as Frame Relay and ATM, the ability to implement layer 2 and layer 3 virtual private networks, and improved control of traffic characteristics.
GMPLS (Generalized Multi-Protocol Label Switching) is an extension of the MPLS protocols to circuit-switched, e.g. optical, networks. GMPLS extends the well-known MPLS mechanisms for new interfaces such as wavelength or fiber, introducing many extensions to existing protocols.
According to the MPLS and GMPLS specifications, their respective network models contain the following three functional planes:
a) a transport plane, also referred to as data plane, responsible for traffic transport and switching;
b) a control plane, responsible for connection and resource management, defined as an IP-based plane, which can be either integrated with or separated from the managed transport network;
c) a management plane, responsible for supervision and management of the whole system, including transport and control planes.
To ensure network resilience, appropriate failure recovery mechanism have to be implemented at all three planes of the network. Protection and restoration of the data plane have been extensively addressed and techniques for data-plane protection and restoration are well known in the art. In a GMPLS network, the integrity of the control and data planes is more or less independent when they are physically separate.
The control plane is responsible for the transfer of signaling and routing messages as well as the management of connections and resources, and therefore has to be reliable to ensure reliability of the whole network. Moreover, the majority of the protection and restoration mechanisms in the transport plane requires an efficient signaling network, which is supported by the control plane. A failure in the control plane can have a fundamental impact not only on new but also on existing connections. A reliable and survivable control plane can be achieved by implementing appropriate protection mechanisms and by providing effective recovery procedures, which allow maintenance of the supported services in spite of failures in the control plane. Therefore, it may be beneficial to focus on minimizing service interruptions due to a control plane failure or during its maintenance.
A review of several prior-art methods for control plane recovery in MPLS and GMPLS networks is provided in an article entitled “Recovery of the Control Plane after Failures in ASON/GMPLS Networks” by Andrzej Jajszczyk, and Pawel Rozycki, published in IEEE Network Magazine, January/February 2006, which is incorporated herein by reference.
An essential part of a control plane of many MPLS networks is the Label Distribution Protocol (LDP). The LDP protocol is a signalling protocol, which is used to set up, maintain and tear down connections in an MPLS network. The Constraint-based Routing Label Distribution Protocol (CR-LDP) is an extension of the LDP, and is used as a signalling protocol for GMPLS-controlled circuit-switched networks. Between two adjacent control nodes, an LDP session is used to exchange LDP messages and control the corresponding data plane links. A failed LDP session results in the loss of LDP state information, which cannot be automatically recovered in a new restarting LDP session unless a specific recovery mechanism is implemented.
In contrast to the fault tolerance of the resource reservation protocol (RSVP), which uses periodical state refreshments, the LDP is vulnerable to hardware and software failures. Routing protocols such as the Open Shortest Path First (OSPF) or the Intermediate System to Intermediate System (IS-IS) are fairly fault tolerant. They exchange information through periodical link state advertisements. If a control plane failure happens, they can still recover after the fault is fixed and the link state advertisement resumes. The LDP's difficulty in failure recovery is inherent to hard-state protocols, e.g., the Border Gateway Protocol (BGP) and the Private Network to Network Interface (PNNI), because their status information is not automatically refreshed.
The importance of handling control plane failures and recovery for a signalling protocol was identified in the prior art. It was suggested that any control plane failure must not result in releasing established calls and connections. Upon recovery from a control plane failure, the recovered node must have the ability to recover the status of the calls and connections established before the failure. Calls and connections in the process of being established (i.e. pending call/connection set-up requests) should be released or continued with set-up.
Known generic failure recovery techniques for distributed systems or control systems may be applied to the LDP failure recovery. In addition, several techniques have been proposed specifically for the LDP failure recovery. These prior-art techniques are typically focused on control plane failures that are associated with either one of two possible kinds of control plane failures: failure of a signaling channel, failure of a control plane's component, which may be either hardware or software related. These techniques have different assumptions and objectives, resulting in different recovery capability, recovery accuracy and speed, and different implementation overhead and cost:
1. Redundant control node hardware or LDP signaling software. A standby backup control node or LDP signaling module may replace a failed one in real time.
2. Persistent storage of relevant information. After a reboot, such a control node may maintain the LDP state information, configuration information, and control plane neighbor information. This his technique relies on the information stored in the failed node itself, resulting in limited recovery capability.
3. Backup signaling channels, when the LDP messages are re-routed over the backup signaling channels if the primary signaling channel fails; this approach is described, for example, in J. Lang (Ed.) Link management protocol (LMP), IETF draft draft-ietf-ccamp-lmp-10.txt, October 2003, and E. Mannie (Ed.) Generalized Multi-protocol label switching architecture, IETF RFC 3945, October 2004.
4. Message logging, when all LDP messages are securely stored and replayed if a failure occurs. This technique relies on the information stored in the failed node itself, which limits the recovery capability from control node failures. In addition, this technique may be harder to scale to a large network.
5. Graceful restart mechanism for the LDP, wherein a downstream node provides to its upstream neighbor label mapping information that the downstream node preserves through a restart. This technique however, may not be applicable to downstream control node failures.
6. Control plane queries the data plane about the channel status. Depending on the data plane capability, the channel status, e.g., in-use or idle, may be extracted to recover a control node's lost status information.
7. Query-and-reply based LDP state information recovery disclosed in “Distributed call and connection management: signaling mechanism using GMPLS CR-LDP”, ITU-T recommendation G.7713.3/Y.1704.3, March 2003. This method can recover detailed LDP state information and is not limited to only recover from the backup state information at direct neighbours; however, is relatively slow and may result in a considerable delay before the node is operational and a new connection can be established.
8. Management system centralized recovery. The network management system may conduct complicated coordination and information transfers, but in a less real time manner.
An alternative solution to recovery has been proposed by the inventors of the present invention in an article entitled “Recovery from Control Plane Failures in the CR-LDP Signalling Protocol,” published in IEEE ICC 2003, vol. 26, no. 1, 2003, pp. 1309-13. This article describes a distributed system of control-plane recovery, where each of the upstream nodes maintain a copy, called a Label Information Mirror (LIM), of the Label Information Database (LID) from a respective downstream node. The LIM is created by using Label Mapping and Label Release messages received from the downstream node. In the event of a control-plane failure, the LID is synchronized with the LIM using new LID TLV and LIM TLV objects.
Advantageously, this method provides a unified distributed solution that is equally applicable to both kinds of control-plane failures, the ones related to signaling channels and the ones related to control plane component of the nodes themselves. However, the amount of information within one LID, and accordingly within one LIM, can be significant, and transmitting it from one node to another to accomplish a complete recovery of all LDP state information may take considerable time. It would be advantageous to provide a method for recovery from a control plane failure that is scalable, does not rely on additional hardware and/or additional requirements imposed on the data plane equipment, and enables a fast restoration of at least basic operation capability of a failed node.
Accordingly, an object of this invention is to provide a scalable method of operating a control plane of a communication network that enables a fast return of the control plane to operation after a control plane failure or other interruption, including control plane maintenance.
Another object of the present invention is to provide a communication network node controller that is capable of a fast recovery after a control plane failure.
Another object of this invention is to provide a system for facilitating a fast recovery of a control plane from a failure that does not rely on additional hardware and on specific recovery support features of data plane equipment.