The present invention relates to computers and, more particularly, to a high-availability computer clusters (i.e., networks of computers that collaborate to minimize interruptions due to component failures). A major objective of the invention is to reduce the downtime associated with migrating a RAM-intensive application from a failed computer to an adoptive computer in a high-availability cluster.
Modern society has been revolutionized by the increasing prevalence of computers. As computers have occupied increasingly central roles, their continued operation has become increasingly critical. For example, the cost of lengthy downtime for an on-line retailer during a peak shopping period can be unacceptable.
Fault-tolerant computer systems have been developed that can continue running without any loss of data despite certain failure scenarios. However, fault-tolerant systems can be quite expensive. High-availability computer clusters have been developed as a more affordable alternative to fault-tolerant system to minimize downtime for mission-critical applications. For example, in a two-computer cluster with an independent hard disk system, data generated by an application running on a first computer can be stored on the hard disk system so that it is accessible by both computers. If the first computer fails, a dormant copy of the application pre-installed on the second computer can be launched and have access to data on the hard disk system.
Unlike fault-tolerant systems, high-availability systems can suffer data loss, e.g., data stored in volatile random-access memory or in processor registers and not otherwise saved to disk is typically lost when the host computer fails. However, if the initial data is stored on the hard disk system, the second instance of the application can recalculate the lost data. For many applications the additional delay involved in recalculating lost data can be quite small and acceptable.
However, there are applications, such as supply-control-software (SCS), that run complex calculations in RAM for days. If a computer fails near the end of a calculation, days will be lost as a second instance of the program recalculates from initial conditions. Such programs often have a “save” function, so that the state of the program can be saved. However, due to the large amount of memory involved, e.g., eight gigabytes, the save can delay calculations considerably, e.g., an hour for each save operation. This extent of delay makes users reluctant to use the save function—and thus exposes them to major delays in the event of a computer failure. What is needed is a cluster system that reduces the amount of recalculation required upon a computer failure, without unduly delaying execution of the application.