In order to provide robust network services, network service providers often utilize redundant systems to ensure service to its customers. Redundant systems provide alternate equipment or components so that in the event of a failure in an active system which is currently providing services, there is an alternate inactive system ready to become active and take over to help ensure as minimal an interruption as possible and to avoid loss of data.
A standard component of network node equipment is a control plane or card which manages or controls the activities of the equipment and in particular a number of line cards which service the traffic flowing through the equipment. To provide redundancy protection, it is common practice to implement a second control plane to act as a backup in the event that the first control plane fails to operate. The control plane which is working and currently providing services is referred to as the active control plane while the redundant backup control plane is referred to as the inactive control plane. In order to ensure proper operation of the network equipment upon a redundant switchover, the state of the inactive control plane should be synchronized with that of the active control plane. Most importantly the active control plane database which houses important information regarding connections that are being carried by the network node including endpoint configuration information should be synchronized, in this case mirrored to, a redundant database of the inactive control plane. If the information in the inactive control plane database is not in synchronization when a redundant switchover occurs the connections or endpoint configurations could be dropped.
According to the current practice which is depicted in FIG. 1, network node equipment participating in communications with a network 10 is coupled by a protection switch 110 to the network 10 so that network traffic 30 can be diverted from active equipment to inactive equipment if a switchover is required. In FIG. 1, an active control plane 120 is logically aware 105 of a state of the protection switch 110. When the protection switch 110 is in active mode, network traffic 30 is directed toward active line cards (not shown) under the control of the active control plane 120. An inactive control plane 140 is also logically aware 115 of the state of the protection switch 110 so that if or when a protection switchover occurs it will become active.
The active control plane 120 includes an admin process 126 which administrates various functions on the active control plane 120 including an active control plane synchronization module 124 which is responsible for synchronizing an active control plane database 122 (DB1), which stores important data elements or attributes, with an inactive control plane database 142 (DB2) of the inactive control plane 140. Synchronization takes place in cooperation with an inactive control plane synchronization module 144 over a synchronization connection 130. The synchronization connection typically is established as an FTP connection upon a request from the admin process 126 of the active control plane 220 for reconciliation with the inactive control plane 140, although any type of connection which allows for the transport of data from the active control plane database 122 to the inactive control plane database 142 would suffice. The data elements or attributes are sent from the active control plane database 122 to the inactive control plane database 142 in the form of synchronization updates. Typically these synchronization updates comprise only state information which has changed in the active control plane database 122 and which needs updating in the inactive control plane database 142. The inactive control plane 140 also has an admin process 146 which administrates synchronization module 144. The admin process 146 checks the database synchronization updates for any errors before writing them to the inactive control plane database 142. The admin process 146 includes a process for initiating a hard reset on an error 148, which responds to any type of error raised during synchronization with a hard reset and full attempt at re-synchronization with the active control plane 120.
Having a hard reset on error directive responsive to an error during synchronization is a preferable resolution in the case that the error is inconsistent or caused by for example a transient hardware failure.
This solution however does not address the issue of a software error in an application being run on either control plane which may cause admin process 146 to detect an error has occurred in the synchronization of DB1 122 and DB2 142. In such a situation a hard reset would not constitute a remedy to the failure. In some cases where the error is consistent and cannot be resolved, the database synchronization process can fail or become trapped in a restart loop in which the inactive control plane 140 never becomes reconciled with the active control plane 120 rendering control redundancy ineffective and leaving the system susceptible to catastrophic failures leading to control complex outages and possible data service outages.
Existing solutions do not take into account that in some cases an error in synchronization is limited in scope or limited in impact within the node and upon the network in general while at the same time the absence of any type of control plane redundancy would in fact have enormous consequences should the active control plane fail. Currently, any error no matter how minor which indicates a synchronization failure is treated as an intolerable error which causes an automatic hard reset.