Communication networks utilize a wide variety of protocols to facilitate data transfers between nodes within the network. As is well known to one of ordinary skill in the art, a network can include a relatively large number of nodes. In such large networks, bringing up the network to an operational state can be a significant undertaking. In addition, changes to the network structure and error recovery can generate message traffic that can overwhelm the network.
One class of protocols that generate significant message traffic when recovering from failures, software upgrades, and the like includes link-state routing protocols such as OSPF (Open Shortest Path First), which is used typically in IP (Internet Protocol) networks, and PNNI (Private Network-Network Interface), which is used typically in Asynchronous Transfer Mode (ATM) networks.
IP and ATM networks are generally organized into one or more areas, each of which includes a link-state database. Link-state routing protocols rely on the exchange of a relatively large number of control messages within each area as the network comes “up,” i.e., becomes operational. For example, the network nodes send and receive Link State Advertisement (LSA) messages in the OSPF protocol and PNNI Topology State Update (PTSE) messages in the PNNI protocol for enabling each node to determine the network topology. As the (OSPF) network comes up, OSPF LSA messages are flooded throughout a network area. A given node may receive more than one copy of the same LSA message in which case the first one is regarded as original and the rest are regarded as duplicates. An original LSA message is acknowledged in an LSA acknowledgement message over the trunk from which it came and its copies are flooded over the other trunks. Duplicate messages are typically discarded after processing.
Another type of OSPF control message is the HELLO message that is periodically exchanged over each trunk connecting neighboring nodes. The HELLO messages are used to determine the status of the trunks, i.e., whether a given trunk is up. There are also some timers which, if expired, result in the generation of control messages. Examples of timers include LSA retransmission timers, HELLO refresh timers and LSA refresh timers.
As the network recovers, a node and its neighboring nodes, which are interconnected by trunks, exchange HELLO messages. The exchange of HELLO messages continues periodically as long as the trunk is up. Next, the nodes perform LSA database synchronization by exchanging all the LSA headers in their respective databases. Each node then identifies any new or more recent LSA messages in the neighboring node's database and requests copies of the identified LSA messages. Each neighbor sends only those LSA messages that have been requested by the other node. In the next step, each node floods a number of new LSA messages throughout the area (or areas) to which they belong. These new LSA messages are either the ones obtained from the neighbor or generated due to a change in topology (e.g., addition of a trunk and/or a node). The flooding of an LSA message by a node results in one original message and usually several duplicate messages at other nodes. An original is acknowledged and flooded over all trunks except the trunk on which the original message was received while duplicates are simply discarded after processing.
During the database synchronization and flooding procedures, the nodes in the network need to process a relatively large number of messages over a short period of time causing a temporary overload at the node processors. Particularly heavily loaded are the nodes that are recovering and/or nodes with a high degree of connectivity. Node processors can typically perform the required message processing over the long term. However, over the short term the overload can cause messages to queue up and even be lost once the allowed queue size is exceeded. While certain types of messages can withstand queuing and loss, the loss of other types of messages can have a negative impact on the network, including network failure.
For example, trunks can be lost due to excessively delayed or lost HELLO messages. HELLO messages are exchanged periodically between neighboring nodes over each trunk connection. These messages indicate the status of the associated trunk. If the HELLO message is not received for a predetermined number of consecutive times (typically three to four times over a period of 15 to 40 seconds) due to excessive queuing delay, or loss due to buffer overflow, then the trunk is declared down even though it is up. If the HELLO messages are eventually processed, then the trunk is declared up, causing another change in the trunk status. Each time the trunk status changes, LSAs are flooded throughout the area. In general, if all the trunks of a node are declared down due to missed HELLO messages, then the entire node is declared down.
A further difficulty that can result in known recovery schemes is so-called re-transmission and “re-transmission lockout” due to excessively delayed or lost LSA acknowledgment messages. If the acknowledgment to an LSA is not received within a certain time period (typically 5 seconds) then the original LSA is retransmitted. The retransmissions cause extra messages in the network and they are typically served at a higher priority than the original transmission. This can cause a slow-down in processing and an increase in queuing for HELLO, LSA and acknowledgment messages. In extreme cases, if enough acknowledgments are outstanding, then the node processor can enter a loop in which only retransmissions are processed, causing a retransmission lockout.
In addition to the OSPF messages described above, there may be other critical messages that monitor various functions of the operating system. If such messages are not processed for an extended period of time, a watchdog timer can reset the node after which the node must recover.
It would, therefore, be desirable to provide a network protocol that overcomes the aforesaid and other disadvantages.