Topology state routing protocols are employed in communications networks in order to disseminate or advertise topology state information among nodes and node clusters within such networks. The advertised topology state information is in turn utilized to compute optimized paths for communications throughout a given network. As typically understood by those skilled in this art, topology state information signifies state information for the network domain as a whole. On the other hand, reference is typically made in the art to local state information when dealing with state information which is locally originated by a particular network node. Local link status information will reflect a given node's understanding of the status of communication with its peer nodes. In the present application, reference to state information will signify both topology state information and local state information.
The state information for a network topology is typically stored in a synchronization database, also called a topology database, which is associated with each network node of a routing domain. Typically, the synchronization database will be stored within the network nodes in question. Database synchronization is an existing topology state routing protocol mechanism which ensures that adjacent nodes within a network share a common view of the overall topology of the network. A network node may be a switch, a router, or other data processing system.
When multi-node network architectures that operate according to topology state routing protocols are initialized, for instance at protocol startup after a network outage, the topology databases of the various nodes of the network must each be synchronized with those of their respective neighboring nodes. As known to those of skill in this art, such synchronization of the topology databases as aforesaid is required in order for routing information to be shared so as to allow the data services of each network node to be used. When synchronization between two neighboring nodes is complete, the link between such neighboring nodes can be utilized for the provision of data services.
As understood to those skilled in the art, synchronization between two neighboring nodes is performed according to varying techniques. In the Private Network-Node Interface or Private Network-to-Network Interface (“PNNI”) protocol, by way of example, such synchronization is conducted in two stages. First, messages are dispatched by each of the network nodes according to the known “Hello” protocol in order to establish two-way communications with neighboring nodes. Second, topology database information is exchanged between such neighboring nodes until their topology state databases are synchronized. The PNNI protocol is specified in a document entitled “Private Network Interface Specification Version 1.1”, ATM Forum Document No. af-pnni-0055.002, dated April, 2002 (the “PNNI Specification”). The PNNI Specification is hereby incorporated by reference. The acronym “ATM”, of course, stands for “Asynchronous Transfer Mode”.
The synchronization described above must take place between and among all neighboring nodes that are affected by a network outage or other cause of network or nodal failure. The synchronization should be performed and completed as quickly as practicable in order to reduce network convergence time, thereby with a view to lessening the non-availability of network services during startup and minimizing the time when the network is not operational on account of the disruption of its routing behaviour.
The discussion above has addressed the impact of a failure affecting a network. However, it will be appreciated by those versed in this art that a failure concerning one or more physical or logical nodes may also result in some measure of disruption to the routing capabilities of the neighboring nodes or devices that are serviced by the failed ordinary node. Although in some node architectures it may be possible to retain certain network functions such as packet forwarding or call processing in the event of a routing function failure, topology state protocols such as PNNI require each network node of a domain to synchronize a topology database with its neighbors before being admitted to the routing system. Such topology database synchronization must take place in these network protocols in order to recover from the failure of a node. The synchronization process may consume seconds or minutes or longer in the overall scheme of recovery, depending on the circumstances. This recovery time may be such as to be unacceptable to service providers or end users. One reason why the recovery time may be lengthy is that most implementations have limited database synchronization resources that must be shared. If all nodes trying to participate in a network failure recovery have a limited amount of resources, then there needs to be a controlled way to ensure these nodes use these resources optimally to make the network recovery as efficient as possible, otherwise, recovery delays as mentioned above can be expected.
During the synchronization, network devices serviced by a failed node will be impacted and hence routing functionality may very well suffer disruption. While the discussion above has focussed on the challenges surrounding recovery from a network failure, those skilled in this art will understand that analogous problems arise stemming from other events which would require a node to undertake a synchronization of its topology database, for instance a failure at the network node level or a reset of the routing processor associated with a network node. By way of example, in some topology state protocols, certain nodes in a communications network may take on distinguished or additional responsibilities in order to make the routing function for the network operate properly. In the Open Shortest Path First (“OSPF”) IP routing protocol as described in J. Moy: “OSPF Version 2”, STD 54, RFC 2328, dated April, 1998, a node identified as the Designated Router (“DR”) would assume such responsibilities. Similarly, in the PNNI protocol, responsibilities of this nature are assumed by a node termed the Peer Group Leader (“PGL”). As compared to the failure of an ordinary network node, a failure affecting a physical node designated with the foregoing additional responsibilities may conceivably impact a relatively larger portion of the network involving network nodes that are dependent from the said failed node. If there is a delay in the database synchronization process (i.e., by not having a controlled way of synchronizing a larger number of nodes each having limited database synchronization resources) of the dependent nodes, then the PGL or DR function may also be delayed. Hence, a greater number of nodes may be impacted due to this delay.
As known to those skilled in this art, current procedures and methodologies in respect of synchronization require that all affected neighboring nodes be involved simultaneously in a network restart or recovery. However, as network sizes increase and the number of neighbors that each network node must synchronize with grows, synchronizing databases with multiple neighbors at once is becoming increasingly problematic because the process of synchronization is highly resource intensive. Many nodal resources are called upon to engage in synchronization activity with a neighboring node, be they those relating to memory, buffer management or processing capabilities. Synchronizing all affected neighboring nodes at the same time can therefore impede the performance of the synchronizing nodes such that all or a part of the synchronizing nodes may not ultimately achieve synchronization in all cases. Alternatively, the resource commitments as previously described according to known methods of synchronization may be such as to cause the synchronization process to stall and to be restarted. This potentially could negatively affect the startup time pertaining to the entirety of the network, and in certain cases the synchronizing network nodes may become so overwhelmed as to cause cascading failures in the node until the node is inoperable. Furthermore, in larger network architectures, the synchronization burden may be so great under current practices as to require network restart one network node at a time to ensure a proper re-establishment of the routing functionality of the network. This is also very cumbersome, manually intensive, and increases the time it takes the failed network to recover.
More recent approaches to network synchronization have sought to limit the number of nodal neighbors that any given network node can attempt to synchronize with at the same time. For instance, those skilled in this art understand that this limit can be the engineered maximum number of simultaneous synching neighbors for the node. Typically, the maximum number of synchronization neighbors is much less than the maximum number of neighbors supported by the node for routing purposes. The latter number continues to grow as networking requirements worldwide continue to increase. As the latter number continues to grow, while the former does not grow as much, the probability of multiple nodes selecting the same pair of nodes to synchronize with at the same time goes down, which increases the expected convergence and recovery time. If pair-wise re-synchronization is not controlled, and left to random chance as is the case in present networks, the network may never recover at all, seriously impacting services.
Accordingly, prior art topology state routing protocols present problems and challenges when faced with a situation of recovery from a nodal failure or with other situations which may require a node to synchronize its topology database once it has previously done so, and these problems and challenges arise whether or not the node immediately affected by the failure has distinguished responsibilities. First, known recovery mechanisms typically disrupt the routing functions of at least a part of a network and cause a service impact to certain of the devices utilizing the network. The portion of the network affected will vary in the circumstances. For instance, the impacted portion of the network can be expected to be more extensive for a node performing distinguished functions than is the case for a node that does not perform such functions. As well, the impacted portion can be expected to be more expansive for a failure concerning a PNNI Peer Group Leader than for one which influences an OSPF Designated Router. Second, the time required to recover from a node or link failure will vary, but may be in the order of up to several minutes or longer. The reasons for recovery delay include a large number of neighbors to synchronize with, limited database synchronization resources, and mismatched neighbors that can lead to a “haphazard” process for re-synchronizing and recovering the failed portion of the network. As mentioned above, this time frame may be unacceptable to certain service providers or end users. Third, since many nodes will have to be made aware of the failure and are therefore required to participate in the recovery process, network resources in the nature of bandwidth and processing time will be diverted. This will detract from other network activities in general and may decrease the performance and stability of the network routing system in particular for a longer than necessary period of time.
A need therefore exists for an improved method and system for node re-synchronization in communications networks. Accordingly, a solution that addresses, at least in part, the above and other shortcomings is desired.