Fault tolerant systems support computer architectures that experience only a few minutes of downtime a year. One way of achieving extended computing uptime is to use a redundant computing system of two computers. One computer, the active computer, actively makes calculations while the second computer, the standby computer, is idle or on standby ready to resume calculations in case the active computer experiences some failure. In these systems, the information about the state of the active computer and data memory must be saved periodically to the standby computer so that the standby computer can substantially take over at the point in the calculations where the active computer experienced a failure.
To synchronize the state of operations for the two computers, checkpointing is used. In checkpointing, the active processor halts, either periodically or as a result of a specific event, and sends its data about its current state to the standby computer. During the checkpointing operation, the host computer is halted and is not performing useful calculations. The length of the checkpointing interval needs to be kept at a minimum while still providing sufficient time for the requisite checkpoint operations to take place. Because of the nature of checkpointing data, the data must be complete and in the correct order on the standby computer when the data is acted upon or committed. In addition, various applications in the telecommunications industry impose additional high bandwidth requirements which can further negatively impact checkpointing and certain computing models.
A telecommunications environment includes network devices that act as routers, firewalls, and other devices providing various network functionality. Typically, each of these network devices comprises expensive specialty hardware providing one or more of the network functions. As additional functionality is required, additional pieces of hardware are installed in the network. This hardware-intensive system is expensive to construct and maintain. In general, in a telecom environment, implementing a fault tolerant system with checkpointing has not been feasible. This follows because the network gating associated with various traditional checkpointing solutions introduces significant network latency. Checkpointing in a fault tolerant system typically adds too much latency to be feasible in a telecommunications network due to the high rate of packet transfers. In addition, replicating network data in a telecommunications environment has also been unworkable because performance is impacted. The performance impact on such a network arises because significant network and computing resources are required to replicate network data as part of a checkpointing process.
The invention addresses this need and others relating to implementing fault tolerance solutions using checkpointing in various networking environments.