Topology state routing protocols are employed in communications networks in order to disseminate or advertise topology state information among nodes and node clusters within such networks. The advertised topology state information is in turn utilized to compute optimized paths for communications throughout a given network. As used in the present application, reference to topology state information signifies state information for the network domain as a whole. In certain network protocols, topology state information includes both link state information and nodal state information. For instance, link state information will include such attributes as link characteristics, link operational status, port identifiers and remote neighbour information concerning adjacent neighbour nodes. Nodal state information will include such attributes as node identifiers, peer group identifiers, distinguished node election status, distinguished node leadership status and local reachable address information.
Whereas topology state information will refer to state information for a network domain as a whole, the present application will make reference to local state information when dealing with state information which is locally originated by a particular network node. Local link status information will reflect a given node's understanding of the status of communication with its peer nodes. Thus, local link status information, similarly to topology link status information, will also include such attributes as link characteristics, link operational status, port identifiers and remote neighbour information concerning adjacent neighbour nodes, but these will pertain to a given network node as opposed to a variety of nodes forming part of a network domain. Likewise, local nodal state information will comprise such attributes as node identifiers, peer group identifiers, distinguished node election status, distinguished node leadership status and local reachable address information. Again, these will pertain to a given node when reference is made to local nodal state information, instead of pertaining to the network domain as a whole when reference is made to topology nodal state information. In the present application, reference to state information will signify both topology state information and local state information.
In some known topology state protocols, certain nodes in a communications network may take on distinguished or additional responsibilities in order to make the routing function for the network operate properly. For instance, in the Open Shortest Path First (OSPF) IP routing protocol as described in J. Moy: “OSPF Version 2”, STD 54, RFC 2328, dated April 1998, a node identified as the Designated Router (DR) would assume such responsibilities. Similarly, in the Private Network-Node Interface or Private Network-to-Network Interface (PNNI) protocol, responsibilities of this nature are assumed by a node termed the Peer Group Leader (PGL). The PNNI protocol is specified in the documents entitled: (i) “Private Network Interface Specification Version 1.0”, ATM Forum document no. af-pnni-0055.000 dated March 1996, (ii) “Private Network—Network Interface Specification Version 1.0 Addendum (Soft PVC MIB)”, ATM Forum document no. af-pnni-0066.000 dated September 1996 and (iii) “Addendum to PNNI V 1.0 for ABR parameter negotiation”, ATM Forum document no. af-pnni-0075.000 dated January 1997, together with amendments found in (iv) “PNNI V1.0 Errata and PICS, ATM Forum document no. af-pnni-0081.000 dated May 1997 (hereafter all of the foregoing documents (i) through (iv), inclusively, are collectively referred to as the “PNNI Specification”). The PNNI Specification is hereby incorporated by reference.
A given physical node within a network space may acquire distinguished network responsibilities of the type mentioned above by a process known as distributed election. In a scheme of distributed election, all nodes at a particular level of a network hierarchy will communicate to select the node which is to assume additional tasks or responsibilities in relation to the topology state protocol. Those skilled in this art will understand that performing the process of distributed election will take varying amounts of time depending on the particular network environment. As well, if due to downtime the distinguishing position is not being filled by a given network node, the routing functions of a portion of the network or of the network domain as a whole may exhibit reduced capabilities or inefficiency during the downtime interval. Thus, it can be expected that in communications networks which utilize topology state protocols, a recovery interval must be tolerated by the network routing system subsequent to the failure of a network node. For instance, this may occur to varying degrees of severity whenever the failed node impacts the functions of an elected network node having the additional responsibilities referred to earlier.
Certain routing protocols specify a given level of node redundancy. This redundancy is intended to reduce the recovery time of the network routing system in the event of a failure that affects a node which performs distinguished protocol functions of the kind mentioned previously. For example, in the OSPF protocol, the use of a Backup Designated Router (BDR) is specified. The Backup Designated Router is mandated to detect a failure affecting the currently appointed Designated Router. Upon detecting such a failure, the Backup Designated Router will be called upon to take recovery action to declare itself the new Designated Router in the place of the failed former Designated Router. All other routers on the affected portion of the shared network will thereafter be notified of the existence of the new Designated Router node. Thus, although it is not necessary to re-execute a dynamic election process under the OSPF protocol following a failure which impacts a Designated Router node, a network routing outage of some duration will nevertheless be experienced by all routers and hosts on the shared network that were originally served by the failed Designated Router node. This is because the affected routers and hosts participate in recovering the functions of the network routing system following a failure which impacts their associated Designated Router node.
On the other hand, in the PNNI protocol, no provision is currently made for distinguished node redundancy. As such, the distributed election process and its associated protocol actions must be re-executed upon any failure affecting a distinguished network node. In the PNNI protocol, a physical node which performs the Peer Group Leader function at one level of the topology hierarchy may be performing this function at several other levels of the hierarchy. Thus, a failure affecting such a physical node may very well impact a large part of the aggregated network. Furthermore, there is no provision in the current PNNI protocol for a backup Peer Group Leader. Thus, a failure which affects a multilevel Peer Group Leader of the kind described above must be detected by all logical nodes which form part of the various Peer Groups that are represented by the multilevel Peer Group Leader. These logical nodes at different levels of the network hierarchy must thereafter elect a new Peer Group Leader. As with the example given previously in relation to the OSPF protocol, the failure of the Peer Group Leader may be known to many nodes and hence such nodes must generally all participate in recovering the affected functions of the routing system. Given this, the failure of a Peer Group Leader in a PNNI network may conceivably impact a large portion of the network and may in many circumstances cause disruption of the routing behaviour of the network for a period of time which may be unacceptable to service providers or end users.
The discussion above has addressed the impact of a failure affecting a network node which has distinguished responsibilities. However, it will be appreciated by those versed in this art that a failure concerning an ordinary physical or logical node which does not possess distinguished responsibilities will also result in some measure of disruption to the routing capabilities of the neighbouring nodes or devices that are serviced by the failed ordinary node. Although in some node architectures it may be possible to retain certain network functions such as packet forwarding or call processing in the event of a routing function failure, topology state protocols such as OSPF and PNNI require each network node of a domain to synchronize a topology database with its neighbours before being admitted to the routing system. Such topology database synchronization must take place in these network protocols in order to recover from the failure of a node. The synchronization process may consume seconds or minutes in the overall scheme of recovery, depending on the circumstances. During the synchronization, network devices serviced by the failed node will be impacted and hence routing functionality may very well suffer disruption. While the discussion above has focussed on the challenges surrounding recovery from a nodal failure, those skilled in this art will understand that analogous problems arise stemming from other events which would require a node to undertake a synchronization of its topology database, for instance a reset of the routing processor associated with a network node.
Certain mechanisms have been developed in the prior art to ensure a switchover between distinct routers in a manner that is transparent to hosts which use a failed router. The Hot Standby Router Protocol described in T. Li, B. Cole, P. Morton and D. Li: “Cisco Hot Standby Router Protocol (HSRP)”, RFC 2281, dated March 1998, and the IP Standby Protocol according to P. Higginson and M. Shand: “Development of Router Clusters to Provide Fast Failover in IP Networks”, 9 Digital Technical Journal, No. 3, dated Winter 1997, are two examples of such transparent router switchover schemes. However, as will be explained in greater detail below, switchover mechanisms of this type do not generally ensure that the switchover will be universally transparent to the routers or nodes in the network beyond the particular hosts or nodes immediately adjacent the failed node. In the prior art, the failure of a node is typically recovered by means of a distinct and different node. It would therefore be advantageous to provide a mechanism that would allow the failure of a routing component of a node to be recovered by another routing component of the same node in a manner transparent to all nodes but its immediate neighbours.
Accordingly, prior art topology state routing protocols present problems and challenges when faced with a situation of recovery from a nodal failure or with other situations which may require a node to synchronize its topology database once it has previously done so, and these problems and challenges arise whether or not the node immediately affected by the failure has distinguished responsibilities. First, known recovery mechanisms typically disrupt the routing functions of at least a part of a network and cause a service impact to certain of the devices utilizing the network. The portion of the network affected will vary in the circumstances. For instance, the impacted portion of the network can be expected to be more extensive for a node performing distinguished functions than is the case for a node that does not perform such functions. As well, the impacted portion can be expected to be more expansive for a failure concerning a PNNI Peer Group Leader than for one which influences an OSPF Designated Router. Second, the time required to recover from a node or link failure will vary, but may be in the order of up to several minutes or longer. As mentioned above, this time frame may be unacceptable to certain service providers or end users. Third, since many nodes will have to be made aware of the failure and are therefore required to participate in the recovery process, network resources in the nature of bandwidth and processing time will be diverted. This will detract from other network activities in general and may decrease the performance and stability of the network routing system in particular.
It is therefore generally an object of the present invention to seek to provide a method and apparatus for database re-synchronization in a network having a topology state routing protocol, particularly well-suited to the context of redundancy recovery following a nodal failure associated with the routing entity of a network node, and pursuant to which some of the problems exhibited by alternative prior art techniques and devices may in some instances be alleviated or overcome.