Modern communication networks are designed as a layered structure in which lower layers provide communication services for upper layers. These upper layers each have a topology which, when configured logically, may he referred to as an overlay topology. For efficiency reasons an overlay topology might not have a one-to-one correspondence with the underlying topology that supports it.
Each layer commonly has its own responses to changes in the network, including alarms in the case of failure. A typical example of a layered topology which illustrates the interplay of alarms and layers is the case of a SONET or SDH (Synchronous Digital Hierarchy) ring supporting IP (Internet Protocol) traffic that is routed by Open Shortest Path First (OSPF) protocol. A network may detect a failure at the SONET layer and respond within 100 ms to switchover traffic using Automatic Protection Switching (APS). If it is unable to take this action, then, in the scale of tens of seconds (typically 30 to 40 seconds) OSPF neighbours whose adjacency was supported by the failed equipment will notice the loss of adjacency and take actions to re-converge the network to support IP routing, e.g. by flooding of Link State Advertisements, and computation of routing tables.
Alarms at different network layers are capable of providing different amounts of information about the nature of a network fault. Lower layer alarms provide more detail than higher layer alarms. For example, SONET systems typically produce thousands to millions of alarms in response to a single fibre break. SONET networks themselves are designed as layered systems and each layer propagates specific alarms up to the management plane. These alarms can be correlated to determine the precise nature of a fault. Loss of IP adjacency supported by a failed fibre, however, only indicates that, at some SONET layer, or possibly due to a break in fibre, a failure to get IP layer messages between peers has occurred.
Not all networks which carry IP traffic have SONET protection beneath them. To provide fast restoration at the IP layer for Hop by Hop routed traffic or for Multi-Protocol Label Switching (MPLS) label switched paths, schemes have been proposed that act on loss of IP layer adjacency in order to route traffic around failed equipment to where it might be forwarded normally again. The requirement for these schemes is to get traffic flowing again on the scale of 10s of milliseconds to support services like voice or video over IP.
A particular problem with current fault recovery systems is that of determining an effective recovery path. As the network upper and lower layers may have a differing topology, there is a risk that a minor ‘route around’ path change may not avoid the fault. There is the further problem of determining the precise nature of the fault. For example, it may not be easy to determine with any degree of rapidity whether a network failure has resulted from a fault in a switch, a router or a fibre path.