One common form of digital communication network is a network that employs a link-state routing protocol. Asynchronous transfer mode (“ATM”) networks and Internet protocol (“IP”) networks are well-known examples of networks that employ link-state routing protocols. While a large class of link state routing protocols exist, common examples include Open Shortest Path First (“OSPF”), primarily for IP networks and Private Network to Network Interface (“PNNI”), primarily for ATM networks.
IP and ATM networks are organized into one or more areas or peer groups. Each area or peer group defines an interconnected group of nodes, which are connected by trunks. End points or customer premises equipment are connected to the nodes. The network is used to provide connectivity to allow data to propagate from one end point to another end point. To this end, the data may pass through several nodes and several trunks, particularly if the two end points are connected to different nodes.
The web-like structure of nodes and trunks of the network define large numbers of alternative data paths between a particular set of two end points. In order to route traffic between the two end points, one of the data paths through the network must be selected. The selection of the path requires information regarding the availability and other status information pertaining to the trunks and nodes in the network. In the case of source routing this information is needed at the source node, and in the case of hop-by-hop routing this information is needed at the intermediate nodes as well. In either case, each node needs to maintain data records of the status of every node and trunk of the network. For trunks, the status may include availability (up or down), an administrative cost to reflect the desirability of routing over this trunk, and the amount of reserved bandwidth for one or more traffic classes. For nodes, the status may include availability (up or down), a list of neighbors to which it is connected and the identification of each trunk connecting it to a neighbor.
In order to maintain such status records, each node on the network from time to time broadcasts status update message routing control messages to all of the other nodes to apprise the other nodes of its status as well as the status of trunks connected to it. The broadcasting is effected using a technique known as “reliable flooding” in which the source node sends the control message to all its neighbors. Each neighbor sends the message to each of its neighbors in turn except for the one from which it received the message. This process is continued indefinitely except that any duplicate message received at a node is discarded and not flooded any further. The reliable flooding ensures that the routing control message will reach all nodes of the network quickly but it also results in many duplicate messages.
The routing control or status update messages are refreshed periodically, and are also sent whenever there is a change in status. By way of example, the PNNI protocol employs control messages known as PNNI Topology State Elements (PTSE) to provide status update information throughout the network. One or more PTSEs may be packed in a single PNNI Topology State Packet (PTSP). Each node provides status update information via one or more PTSEs under two different circumstances, 1) as periodically scheduled updates (typically once every 30 minutes) or 2) in response to significant changes in status. Examples of significant changes in status include a trunk failure, a substantial change in the reserved bandwidth of a trunk, or the recovery of a node or trunk.
The PNNI protocol also employs signaling messages that are used to establish or tear down “calls” or virtual circuits between endpoints of the network. Over an established virtual circuit between end points of the network, user data may be transmitted in the form of voice, facsimile, electronic mail, or otherwise. There are different types of virtual circuits including Switched Permanent Virtual Circuits (SPVC) and Switched Virtual Circuits (SVC).
The routing control messages used in OSPF are similar to those used in PNNI. The status update messages in OSPF are known as Link State Advertisements (LSAs) and one or more LSAs may be packed in a single Link State Update (LSU) message. The OSPF used for hop-by-hop routing of data packets in IP networks does not use any trunk reserved bandwidth information or signaling. However, OSPF with Traffic Engineering extension (OSPF-TE) uses trunk reserved bandwidth information and signaling is used for establishing or tearing down Multi Protocol Label Switching (MPLS) Label Switched Paths (LSP). Other link state routing protocols are likewise configured. We will use the PNNI terminology but our description would also apply to other link state protocols in a generic sense.
From time to time, scheduled and/or unscheduled events alter the status of one or more network entities (nodes and/or trunks). Scheduled events may include bringing down a subset of nodes or trunks to perform software upgrades, testing, or the like and bringing them back up at a later time. Unscheduled events may include failure of a subset of nodes and/or trunks and bringing them back up at a later time. In either case, the change in status of the nodes and/or trunks triggers a flooding of control messages as discussed above. For example, if a trunk fails or recovers, then the nodes at its two endpoints would generate routing control messages. If a node fails, then each trunk connected to it would also fail and routing control messages would be generated by its nodes at the other endpoint of the trunk. In addition to the initial flooding of status update messages, as nodes or trunks fail many SPVCs and SVCs passing through them need to be rerouted through other paths thereby generating additional signaling control messages. Furthermore, the rerouting of SPVCs and SVCs may cause many trunks to experience significant changes in reserved bandwidth which would also generate status update messages. As nodes and trunks recover, some existing SPVCs and SVCs may reroute to utilize a more optimal path which in turn would generate more signaling and routing control messages.
Thus, node and/or trunk failures can cause the propagation of multiple control messages, thereby forming a “storm”. If large numbers of control messages are generated over a short period of time, then processors within the nodes that process the control data may begin to overload, the memory used to store the messages may begin to exhaust, and/or trunks may become too busy from transporting all of the control messages. The overloading of the node processors and/or the trunks could delay routing control messages and memory exhaustion may cause them to be dropped. This may result in many retransmissions of the dropped control messages, thereby worsening the storm. Moreover, if particular status maintenance messages, for example, the keep alive or Hello messages used to maintain trunk status between neighbors, are delayed excessively, then the trunk may be declared down which may cause generation of many routing and signaling control messages. In addition, when the trunk recovers (or is declared up) more routing and signaling control messages would be generated.
Thus, it has been determined that scheduled and unscheduled events can trigger a control message storm that can create positive feedback to cause additional events, thereby increasing the severity of the control message storm. Such a storm having positive feedback and potential for propagation from one congested node to others, referred to herein as a network congestion event, can create severe congestion and even failure of the network.
In response to network congestion events, attempts may be made to inhibit escalation of the event so that the network continues to operate with stability. However, despite attempts to inhibit escalation of network congestion events, some remedial measures may not be enough to preserve full connectivity in the network. Loss of connectivity within the network, combined with the additional control message storm resulting therefrom, can result in total network failure.
There is a need therefore, for inhibiting total network failure even when other remedial measures cannot preserve full network connectivity.