In order to achieve high availability of network resources, it is often desirable to configure communication systems with redundant elements, running in synchronization with each other. These redundant network elements, also referred to as “mirrors”, receive the same inputs from outside the system, perform the same processing, and are capable of generating the same outputs. However, typically only one of the mirrors (i.e., “the primary”) has its outputs enabled, while the outputs of the other mirrors (i.e., “the backups”) are suppressed. In the event of a failure of the primary, the outputs of one of the backups may be enabled so that the functions of the primary can be taken over without interruption of network services. This is referred to as a failover.
To ensure proper recovery from network failures of the primary, it is critical for the backups to remain synchronized with the primary. Therefore, it is necessary for the redundant elements to periodically exchange state information (referred to as “checkpoints”) with the primary. This information may be used by each network element to verify that it is still in synchronization with the other elements, and to restore synchronization, if necessary.
Unfortunately, it takes time to create a checkpoint at the primary, transmit it to another element, and process that checkpoint at the destination node. During that time, the receiving node may have continued to receive system inputs and thus its state may no longer match the state recorded in the checkpoint. Reconciling these two states can be a difficult problem, requiring complex and error-prone programming.
Currently, a number of solutions exist for checkpoint synchronization. One obvious approach is to stop the system from accepting new inputs from the time a checkpoint is generated to the time it is processed by the other mirrors. Unfortunately, this solution negatively impacts system responsiveness and makes it vulnerable to new failure modes.
Another approach is to include in the checkpoint an index number that monotonically increases each time a mirror receives and processes new inputs. When a mirror receives a checkpoint, it compares the index number contained in the checkpoint with the current value of its local index number. If the mirror's local value is higher, it discards the checkpoint as obsolete. The primary shortcoming of this approach is that, in the common case where system inputs arrive continuously, most if not all of the checkpoints will have to be discarded.
One brute force solution is to program the checkpoint comparison such that state changes that occur at the receiving node subsequent to the generation of the checkpoint may be factored out and ignored. However, this approach requires extremely careful design and implementation and is susceptible to subtle bugs not detectable in testing. In addition, it may be impossible to implement on certain payload applications.
In view of the foregoing, it would be desirable to provide a solution for checkpoint synchronization which overcomes the above-described inadequacies and shortcomings. More particularly, it would be desirable to provide a technique for synchronizing redundant network elements in an efficient and cost effective manner.