With the growing dependency of organizations on electronically stored data, it has become necessary to devise backup methods and recovery procedures for all possible situations which may result in data being lost. One class of recovery procedures is the disaster recovery procedure. Its purpose is to perform a recovery in the case of a total destruction of the computing facility. If no special measures are taken the only recovery possible in this case would be a reconstruction of the data on another facility from a backup which is kept at a remote location, and which thus survives the disaster.
One possible disaster recovery strategy is known as Mirroring. This involves the continuous maintenance of a mirror copy of the data at a remote computing site. When a destruction occurs the remote site will take over using the mirrored data.
However, for the reasons explained below, the mirroring strategy is not appropriate for systems, such as database management systems, in which data sets are interdependent so that a change in one data set requires a corresponding change or changes in others of the data sets to ensure consistency of the data.
A file system is said to be consistent if it represents a state of the data set system after applying a series of complete logical updates or transactions. When a system failure occurs the file system is normally in an inconsistent state because some updates have not been completed. It is up to the recovery procedure to bring the file system back to a consistent state. A good recovery procedure will also bring the file system to its closest consistent state. By closest consistent state is meant a state which reflects all transactions except those which were disrupted at the time of failure. More generally, the closeness of a recovered file system to its copy before failure can be measured in the number of complete transactions required to bring it to its closest consistent state.
A data set is said to be insensitive to failures if after any system failure, apart from a crash of the device on which it resides, it remains consistent. This type of a data set cannot corrupt the consistency of the file system to which it belongs. Most of the sequential files maintained by the operating systems TSO and CMS belong to this category.
A data set is said to be sensitive to failure if there is a possibility that upon system failure it will become inconsistent or will cause the file system to become inconsistent. Most database files belong to this category.
File systems generally consist of the following three types of data sets:
1. Database application data sets. These data sets are mainly used to hold the application information in a database environment. Data sets belonging to this class are sensitive to failures.
2. The database log data sets. These data sets hold data, generated by the database management system, which is intended to aid the recovery procedure in bringing the sensitive database data sets back to a consistent state. Data sets belonging to this class are insensitive to failures.
3. Simple data sets. These are the non-database files. These data sets are also insensitive to failures.
Most conventional Data Base Management Systems use, in one way or another, a single insensitive file to assist in the recovery of sensitive data sets in the event of, say, a power failure which does not result in the destruction of the storage devices on which the data is stored.
On the face of it, the mirroring strategy guarantees that no data is lost. However, in practice, there are severe problems with implementing this strategy. The main problem is that, due to communication delays on the link between the two sites, updates at the remote site do not occur simultaneously with updates at the local site.
Thus, when a disaster occurs the mirrored volumes will be in an unknown state since some data would have been lost due to communication delays. Some complete transactions will be missing from the mirrored disks and some transactions will be partially completed leaving the file system in an inconsistent state.
While the case where a small number of complete transactions are missing may sometimes be tolerated, being left with an inconsistent file system is totally unacceptable. Unless some very complicated measures, such as imposing some order on updates at the remote site, are taken, it is almost impossible to bring such a file system back to a consistent state after a disaster occurs.
For these reasons, to ensure that the mirrored copy of a complex file system at the remote site is always recoverable, it is necessary to delay the confirmation of a correct completion of all write operations in the local site until the data is safely written to the remote site. Such a delay seriously impairs the response time to updates.
Another approach to disaster recovery which can be used with log-based systems is known as check pointing. This involves the storage of some initial state of the database and the continuous updating of the log at the remote site. When a disaster occurs the entire database may be reconstructed at the remote site from this initial state and the log.
Check pointing does not require a delayed confirmation for the writes to the local log, because the database itself is not continuously updated at the remote site. The closeness of the recovered file system depends on the state of the remote log at the time of the crash. If the log is up-to-date the recovered file system will be in its closest consistent state, otherwise it will be in some other more "distant" consistent state. If confirmation for writes to the log at the local site are delayed until the remote site confirms that the remote log is correctly updated, then the recovery will always be to the closest consistent state. However, the recovery procedures based on this strategy are very inefficient since they normally take a long time to reconstruct a file system from its log.