With the development of cost effective data communications network infrastructures, such as IP-based data networks, it is increasingly common for such infrastructures to support mission critical data processing applications. So-called high availability computing systems originally developed for deployment in applications such as military, aircraft navigation, and telephone central office uses are now requirements in new deployments of these data communications networks. High availability is commonly achieved with redundant components such as redundant processors where failure modes result in a fail-over to a redundant component. High availability can also be achieved by rapidly recovering the failed components. Fast recovery of routing information in failure scenarios in network systems is important due to the relatively long time it takes to regenerate this information in large and complex networks.
The most stringent requirements for high availability demand continuous service with absolutely no loss of application state. These systems attempt to maintain a log of all transactions and their history; they are considered the domain of so-called fault tolerant computing. These computers often add redundancy to an extreme level as power supplies, hardened storage sub-systems, hardware subjected to stringent Mean Time Between Failure (MTBF) testing, and the like. Continuous availability, both during equipment failure and during subsequent return to service of repaired equipment, comes with a significant price and performance penalty.
High availability computing as presently practiced attempts to utilize the resources of redundant architectures. This solution can address the redundancy needed for components of systems, such as a networking device such as router, switch, or a bridge that is expected to serve a mission critical role in assuring that, for example, connections to many computers are maintained to the Internet. However, the class of errors typically detected in such systems is less comprehensive and the time to recover from such errors is typically much longer than in a true fault-tolerant machine architecture. As a result these architectures, even when they provide for fault recovery only after tens or hundreds of seconds, can be deployed for often at much less than the cost of traditional fault-tolerant computing systems.
The most often configuration is a so-called dual redundant architecture in which two data processing systems are deployed as an active-standby or master/non-master states. Hardware and/or software fail-over processes can be triggered by hardware, or software detectors, to cause an active or master process to be transferred to another active master process without operator intervention. Such application program fail-over typically requires that applications be restarted from the beginning, however, with the loss of all processing state not already committed to the secondary storage device such as a disk.
In an application such as a networking device, actively restarting the application in a functional processing node typically assumes the responsibility for reassigning, for example, the network addresses of the failed machine to the new processor, as well as rebuilding critical information such as routing tables. The transfer of network address and connection information can be typically handled quite easily and without complication.
As the size and complexity of data network increases, a router located deep within a network may have received its state information and constructed its routing table over the course of time. If router table state information is lost, it can be cumbersome and time-consuming to restart a router process and rebuild a router table. The information can only be restored by sending a long series of query and advertisement commands through routing protocols, such as an Interior Gateway Protocol (IGP) or an Exterior Gateway Protocol such as BGP-4. Upon the restart, it may thus take many seconds or even minutes, for router protocols to completely rebuild such tables.
Even more severe situations can occur where the rebuilding of the router table is not completed before real-time topology changes in the surrounding network occur. In such instances, the protocols may continuously reset themselves, thereby ultimately creating a race condition in that the process for rebuilding the router table never completes without some sort of manual intervention.
It desirable therefore for such systems to adopt certain high availability architectures, such as providing dual or backup power supplies, dual and separate system processor cards, and live insertion or “hot swap” capabilities that support replacement failed components without shutting down the entire system.