The invention relates generally to network communications. More specifically, the invention relates to a method and system for assuring component circuits transported in aggregated circuits restore correctly after an aggregated circuit fault.
With the advent of Ultra Long Haul (ULH) networks and planned availability of very high speed Optical Carrier OC-768 links, switched circuit-based networks will evolve from the current flat topology to a hierarchical network. The ULH Dense Wavelength Division Multiplexing (DWDM) transport optical network supports OC-768 core capacity with transmission speeds up to 40 Gbit/sec, carrying Internet Protocol (IP), Multi Protocol Label Switching (MPLS), and Synchronous Optical Network/Synchronous Optical Hierarchy (SONET/SDH) services.
Switched circuit-based networks typically include a number of switches connected by copper or optical communication lines. Switches are computer networking devices that encompass routers and bridges, as well as devices that may distribute traffic on load or by application content and may operate at one or more OSI layers, including physical, data link, network, or transport (end-to-end). There may be multiple lines between a given pair of switches and not every pair of switches needs to be connected to each other. Communication lines may be of various capacities that are generally expressed in bandwidth units such as OC-N, where N=48, 192, 768 . . . . Lines are often grouped, or aggregated, into links and certain information is associated with a link.
FIG. 1 shows a non-hierarchical network 101 that includes a plurality of switches 103, 105, 107, 109, 111 that define links 113, 115, 117, 119, 121, 123, 125 and two end systems 127, 129. A circuit 131 between the two end systems 127, 129 is provisioned going from switch 103 to switch 107. In non-hierarchical networks, circuits are provisioned between pairs of switches and several classes of services may be carried on these circuits.
A circuit 131 in a non-hierarchical network has two end points, a source switch 103 and a destination switch 107, and can span one or more intermediate switches 105. The source switch 103 is responsible for setting up the circuit and for restoring the circuit if a fault or failure in the network 101 route takes the circuit down. Switches in the circuit route adjacent to a failure detect the failure, identify that the circuit is affected by the failure, and send release messages to the source 103 and destination 107 switches. The release messages travel along the circuit route and release all resources held by the circuit. The source switch 103 tries to re-establish (restore) the circuit on an alternate route to the destination switch that avoids the point of failure. This is referred to as end-to-end restoration.
The end systems 127, 129 are connected to the network 101 but are not considered part of it. The circuit 131 between the end systems 127, 129 is established by routing it between the two switches connected to the end systems and network, and can span multiple links. The sequence of links spanned by the circuit is referred to as its service route.
If there is a failure in the network affecting one or more of the links or switches within the service route of the circuit, the circuit fails. In this case, the circuit may be re-routed on a new (restoration) route that avoids the failed portions of the network. After the failure is repaired, the circuit may revert back to its original service route.
Circuit restoration speeds are of paramount importance in such networks and sub-second restoration is guaranteed for a majority of premium circuits even in rather big failure scenarios. Typically, a single processor controls all restoration activities in each switch, and the restoration speed deteriorates with the number of failed circuits.
Most networks use routing and signaling protocols to automate a variety of functions such as self-discovery of network resources, construction and maintenance of a link-state database of routing information across all switches, automatic provisioning and restoration of circuits, determination of routes for provisioning and restoration of circuits, detection of network failure conditions, flooding of information related to any change in the state of the network to all switches including failures of switches and links, change in available bandwidth on a link and others. The routing and signaling protocols include Open Shortest Path First (OSPF), MPLS, Private Network-to-Network Interface (PNNI), etc., and variants of these protocols that have been adapted to specific networks or applications.
These networks are characterized by the fact that intelligence is distributed in every switch and is not centralized in one or more central locations. Typically, all switches run the same set of protocols although the functions performed by the switches may vary based on how the switches are used. For example, border switches in an OSPF domain have greater functionality than other switches. Thus, switches employing the same or similar protocols operate independently of each other. Any co-ordination of activities between switches is performed by sending messages to each other in ways prescribed by the routing and signaling protocols.
For cost saving efficient operation, multiple circuits are aggregated into a larger aggregated circuit referred to as a bundled circuit, tunnel, pipe, etc., establishing a higher level in a hierarchical architecture. Failures in the higher level cause the aggregated circuit to be released and restored as a single entity entirely within the higher level hierarchy. This allows for much faster restoration than if the individual “component” circuits making up the aggregated circuit were restored separately. Aggregated circuits can grow in size as individual component circuits provisioned in the network are added to the aggregated circuit and may shrink in size as individual circuits are de-provisioned. If there is a failure in the lower level of the hierarchy, each impacted component circuit is restored end-to-end even if it is part of an aggregated circuit somewhere along its route. A failed component circuit is de-provisioned from the aggregated circuit and then re-established in the network, for example, by joining another aggregated circuit along the restoration route.
With circuit aggregation in a two-level hierarchical network, an end-to-end circuit route typically has three components. There is a middle segment in the higher level hierarchy that may be part of an aggregated circuit, and two tail segments, a source tail segment and a destination tail segment at each end of the circuit. FIGS. 2 and 3 show a component circuit X that is part of an aggregated circuit Z. The route of component circuit X includes source switch A, destination switch I, and intermediate switches B-C-D-E-F-G-H. Switches C, D, E, and F are SW+ switches (shown checkered) while the other switches are SW switches (shown solid). SW+ switches are capable of carrying higher speed circuits as well as multiplexing several lower speed circuits inside a higher speed circuit. In the example, the SW+ switches may mesh with OC-768 ULH links. Below this higher level may be a larger footprint with SW switches, meshed to each other and to the SW+ switches with OC-N links. The OC-768/SW+ part of the network is the higher level of the hierarchy and can support much larger circuit sizes. The SW part of the network is the lower level of the hierarchy and may have smaller circuit speeds. The most general circuit in the hierarchical network can begin and end in SW switches and may be provisioned over a sequence of SW and SW+ switches.
The aggregated circuit Z is defined between switches C and F. One of the switches acts as the source switch for the aggregated circuit, for example, switch C, and another switch, for example, switch F, acts as the destination switch of the aggregated circuit. FIG. 3 shows another component circuit Y that has been aggregated into aggregated circuit Z. Component circuit Y starts at switch B and ends at switch H.
A new circuit order is provisioned between a pair of switches. One switch is selected as the source switch of the circuit and the other becomes the destination switch. The source switch calculates a route for the circuit using information collected by the routing protocol. The information typically includes network topology, available network resources, etc. The route must have sufficient network resources to meet quality of service (QoS) requirements (bandwidth, delay, etc.) for the circuit. For example, in FIG. 1, the route calculated by source switch 103 for circuit 131 is via switch 105 to switch 107, and travels over links 113 and 115. The circuit route is specified as a sequence of links. For example, the route for circuit 131 is the sequence of links 113 and 115 and can be denoted as links {113, 115}. An alternative route between the source 103 and destination 107 switches using different links would be distinct from the route of circuit 131. For example, an alternative route comprising links {121, 123, 125}.
The source switch sets up the circuit using signaling protocol. A setup message is sent out along the calculated route of the circuit. Each switch in the route checks to see if the requested resources are available and then allocates the resources to the circuit. The setup message contains the selected route, so each switch in the route can forward the message to the next switch in the route. If all switches are able to allocate the resources, the setup succeeds. If not, it fails. A failed setup may result in a crankback message to the source switch that then tries to set the circuit up on a different route. Crankback is a mechanism originally used by Asynchronous Transfer Mode (ATM) networks. The new route must also have sufficient resources to meet the needs of the circuit.
A single optical fiber cut may cause multiple link failures in a network. Multiple failures where a link fails in each level of the hierarchy simultaneously results in the failure of the aggregated circuit as well as a tail segment of one or more component circuits. A failure in a component circuit tail segment will be restored end-to-end.
Whenever a failure occurs, a number of component and aggregated circuits may be impacted. The switches adjacent to the failure first detect the failure condition, identify the circuits affected by it, and then initiate signaling messages releasing the allocated circuits. The release messages travel back to the source and destination switches of the provisioned circuit, releasing all resources held by the circuit along the way. The source switch of each failed circuit then calculates a new route and tries to establish the failed circuit on the new route. This is referred to as restoring the circuit.
The new route must have sufficient resources to meet the needs of the circuit. It must also avoid the failed part of the network. Information about the failed part is disseminated by the routing protocol but there may be a short delay in receiving this information. The release (crankback) message may also contain information regarding where the circuit (setup) failed. Generally, the procedure used to restore the circuit is identical to the method used to provision it in the first place.
FIG. 4 shows three different failure points. Failure points 1 and 3 affect the source and destination tail segments of component circuit X and will cause end-to-end restoration of component circuit X. Failure point 2 affects aggregated circuit Z and causes restoration of just aggregated circuit Z on a new route between switches C and F, the two end switches of aggregated circuit Z. Since component circuit X is an aggregate circuit of aggregated circuit Z, the restoration of aggregated circuit Z results in the restoration of component circuit X as well.
Failure point 1 between switches A and B in the source tail segment of component circuit X is detected by switches A and B. Switch B determines that component circuit X has failed and sends a release message for circuit X towards component circuit X's destination switch I, along route B-C-D-E-F-G-H-I, releasing all resources held by circuit X along the route. Component circuit X will be de-provisioned from aggregated circuit Z by switches C, D, E and F as a result of the release message.
Source switch A also determines that component circuit X has failed and that it is the source switch of component circuit X. It therefore does not need to send any release message for component circuit X but has the responsibility to restore component circuit X on a route that avoids failure point 1.
Similarly, failure point 3 between switches F and G in the destination tail segment of component circuit X is detected by switches F and G. Switch F determines that component circuit X has failed and sends a release message for component circuit X towards the source switch A along the route F-E-D-C-B-A releasing all resources held by component circuit X along the way. The source switch A has the responsibility, as before, to restore the component circuit X along a route that avoids failure point 3. Switch F (and switches E, D, C as well) determines that component circuit X is part of aggregated circuit Z and, as part of the release process, de-allocates it from aggregated circuit Z. Switch G also determines that component circuit X has failed and sends a release message for component circuit X towards the destination switch I along the route G-H-I releasing all resources held by component circuit X along the way.
Failure point 2 in the middle segment where component circuit X is part of aggregated circuit Z is detected by switches E and F. Both switches determine that aggregated circuit Z is affected.
Switch E sends a release message for aggregated circuit Z to its source switch C along route E-D-C. The release message will cause resources held by aggregated circuit Z to be returned, and cause switch C to restore this aggregated circuit on an alternate route (not shown) that avoids failure point 2. The restoration for aggregated circuit Z is complete.
However, one required action for failure point 3 is for an end-to-end release message for circuit X to be sent by switch F to switch A along the route F-E-D-C-B-A. However, in a double failure scenario such as at failure points 2 and 3, this release message never makes it to switch E, or to any of the other switches along the route of the circuit to switch A. Switch A does not know that it should restore the circuit end-to-end. A release message for circuit X sent by switch G does reach switch I, but switch I is the destination switch of the component circuit X and cannot trigger restoration. It simply de-allocates resources held by circuit X and does nothing more. Therefore, the destination tail segment of component circuit X is not restored and component circuit X remains down even though aggregated circuit Z is successfully restored.
Not only is component circuit X not restored in this scenario, the switches have no way of knowing that component circuit X has not been restored. The normal protocol function is to repeatedly retransmit release messages, when a retransmit timer expires or until they are acknowledged by the recipient of the message. Switch F will attempt to retransmit the end-to-end release message to switch A repeatedly. The retransmitted messages will not go through until failure point 2 is repaired which may take several hours or even days depending on the severity of the problem.
It is imperative that a network recovers quickly from failures. The typical time taken to restore circuits may be in milliseconds if a few circuits are involved to several seconds if many circuits are involved. The entire process of detecting the failure, identifying the affected circuits, sending out release messages to the source and destination switches of each circuit, and the source switches then restoring their respective circuits, needs to be completed in a very short amount of time. Moreover, the entire process needs to work in the presence of failures.
The problem is that multiple failure scenarios prevent end-to-end circuit restoration from taking place because the component circuit release message cannot reach its source switch. The aggregated circuit will restore successfully. Since one of the tails of the component circuit in the lower level hierarchy has failed, that component circuit remains down.
What is desired is a system and method that ensures that whenever an aggregated circuit in a higher level of a network hierarchy is restored due to a failure in a link or network element, the component circuits making up the aggregation are restored as well.