1. Field of the Invention
The present invention relates generally to an improved data processing system, and in particular, to improving failure tolerance in data processing systems. Still more particularly, the present invention relates to a system, and computer usable program code for recovery in a shared memory environment.
2. Description of the Related Art
When a failure occurs in a data processing system, it is desirable to reinitiate the data processing system from a known time of operation in the past. As a part of reinitiating the data processing system, data, processes, application status, and other information is restored to the known time in the past and the system operation recovered from that point in time. The known time is called a checkpoint. In other words, a checkpoint is a view of the data, processes, application statuses, and information in a data processing system at some time in the past.
In order to be able to accomplish a recovery operation from a checkpoint, the data, states, and other information existing in the data processing system at the checkpoint are saved from a memory to a highly available data storage system that can withstand failures, herein called stable storage. Such data, states, and other information at a checkpoint are collectively called checkpoint data.
Typically, checkpoint data is collected and saved at a number of checkpoints as a data processing system continues to operate. In case of a data processing system failure, a user or the system restores the data processing system operation from the most recently saved checkpoint by repopulating the data processing system with the checkpoint data.
A user or the system may determine how often the checkpoints occur during a data processing system's operation. When a new checkpoint is successfully saved, previous checkpoints may be purged to reduce the space needed on stable storage.
An inverse relationship exists between the frequency of taking the checkpoints and the amount of rework a data processing system has to perform to compute again up to the point the failure occurred. The less frequently the checkpoints are taken, the higher the likelihood that the checkpoint is farther back in the past from the point of failure, and the more rework the data processing system has to perform to re-compute up to the time the failure occurred. The more frequently the checkpoints are taken, the higher the likelihood that the checkpoint is closer to the time of failure, and the lesser the work and the resources have to be expended to restore operation and recover the data processing system to the time of failure.