Communication networks can include various types of protocols that route data through the network. One such type of protocol is referred to as a link-state protocol. Known link-state protocols include Open Shortest Path First (OSPF), which is used in Internet Protocol (IP) networks, and Private Network-Network Interface (PNNI), which is used in Asynchronous Transfer Mode (ATM) networks.
IP and ATM networks are generally organized into one or more areas each of which includes a link-state database. Link-state routing protocols rely on the exchange of a relatively large number of control messages within each area as the network comes “up,” i.e., becomes operational. For example, the network nodes send and receive Link State Advertisement (LSA) messages in the OSPF protocol and PNNI Topology State Update (PTSE) messages in the PNNI protocol for enabling each node to determine the network topology. As the (OSPF) network comes up, OSPF LSA messages are flooded throughout a network area. A given node can receive more than one copy of the same LSA message in which case the first LSA message is regarded as the original and the other LSA messages are regarded as duplicates. An original LSA message is acknowledged over the trunk from which it came and copies of the message are flooded over the other trunks. Duplicate messages are typically discarded after processing.
Another type of OSPF control message is the HELLO message that is periodically exchanged over each trunk connecting neighboring nodes. The HELLO messages are used to determine the status of the trunks, i.e., whether a given trunk is up. There are also some timers which, if expired, result in the generation of control messages. Examples of timers include LSA retransmission timers, HELLO refresh timers and LSA refresh timers.
Generally, link-state routing protocols do not specify the order in which the various control messages are to be serviced when more than one message is outstanding at a network node processor. In accordance with conventional practices, the control messages are serviced in a First-Come-First-Served (FCFS) manner. In some instances, control messages triggered by the expiry of a timer are serviced at a higher priority than other messages without making any further distinctions between the message types.
One disadvantage with such link-state message processing schemes is that certain message types may not be timely processed due to network congestion whenever a relatively large number of LSA messages is generated within a relatively short time interval in the network. Such an event is referred to as an “LSA storm.” The network congestion can be the result of nodes/trunks going “down” or coming back up. An LSA storm can be generated due to the failure or recovery of a single trunk, group of trunks, single node, or group of nodes. The failure/recovery can result from a hardware failure or software upgrade, for example. The LSA storm can also be generated due to a near-synchronous refresh of large numbers of LSAs and to sudden bandwidth changes in virtual circuits in the network.
One problem associated with LSA storms is the loss of trunks due to excessively delayed processing of HELLO messages. As long as a trunk between neighboring nodes is considered up, HELLO messages are exchanged between the nodes over the trunk periodically with period T, which is typically between about 5 and 10 seconds. If one of the neighboring nodes does not receive a HELLO message for a predetermined number of consecutive times, e.g., four, the node declares the trunk to be down.
During an LSA storm, HELLO messages can pile up until HELLO messages from neighboring nodes such that they may not processed in a timely manner. For example, HELLO messages are queued behind other control messages arriving at the node before the HELLO messages. Furthermore, if timer-triggered messages are served at a higher priority, then the HELLO messages also have to wait behind control messages triggered by the expiration of a timer. If the total waiting time of a HELLO message is longer than a specified duration nT, which is typically between 15 and 40 seconds, then the trunk will be declared down even though it is up.
For example, a node having 50 trunks and a 1 millisecond processing time for receiving or transmitting a message over a trunk can experience a HELLO message queuing delay of about 15 seconds with an LSA storm of size 150, and a queuing delay of about 40 seconds with an LSA storm of size 400. The LSA storm size corresponds to the number of LSA messages in an LSA storm. If the processing time is doubled, e.g. 2 ms, then the same queuing delays would result from LSA storms half as large.
Declaring a trunk down while it is actually up is disadvantageous for several reasons. Declaring the trunk down triggers the flooding of LSA messages to the entire area (or areas) in which the trunk is located. In addition, all Virtual Circuits (VCs) over the trunk are released and rerouted. Once the waiting time of a HELLO packet is over and the message is processed, the node may declare the trunk up causing possible further VC rerouting. Declaring trunks down while they are up also results in wasted bandwidth and inefficient routing. Furthermore, erroneously declaring trunks down on a relatively large scale can cause the entire network (or area of the network) to enter an oscillatory state that can bring the network down. Thus, LSA storm effects are exacerbated by the very events of trunks going down and up.
A further disadvantage associated with conventional link state message processing is the occurrence of so-called LSA retransmission lockout in which the node processor enters a loop that processes only retransmissions and other timer-triggered messages. Thus, the node processor does not process HELLO, LSA and LSA acknowledgement messages arriving from other nodes while in the loop. LSA retransmission lockout can occur when timer-triggered messages are served at a higher priority than other messages and the timer-triggered messages are generated at a rate equal to or higher than the rate at which they can be processed by the node processor.
LSA retransmission lockout typically results from a combination of events. There are generally three main types of timers: HELLO refresh timers, LSA refresh timers, and LSA retransmission timers. The rates of message generation due to the expiry of the HELLO and the LSA refresh timers are fixed and independent of network conditions (typically one HELLO message per 5 to 10 seconds per trunk and one LSA refresh every 30 minutes per LSA originated by the node). Thus, these messages require only a relatively small fixed fraction of the node processing power.
The rate of message generation due to the expiry of LSA retransmission timers is typically one message every 5 seconds per unacknowledged LSA. This rate depends upon the level of network congestion. Under normal operating conditions the rate of message generation is close to zero since very few LSAs remain unacknowledged for more than 5 seconds. However, under heavy network congestion generated by an LSA storm, it is possible for many LSAs to remain unacknowledged for more than 5 seconds due to congestion either at the transmitting node or at the receiving node, such that LSA retransmission lockout can occur.
Once a node processor enters a retransmission lockout state, it does not process any messages that are not triggered by a timer. This includes acknowledgements to earlier transmissions and retransmissions that would help the node processor to get out of the retransmission lockout state. Eventually the node processor can get out of the retransmission lockout since the LSAs being retransmitted age out. However, this happens after an unacceptably long time, e.g., one hour, before which the node typically goes down.
It would, therefore, be desirable to provide a link-state network protocol that enhances the ability of a network to handle LSA storms.