In data processing systems, the failures that can occur include communication failures (in online systems), data set or database failures, application or system program failures, processor failures and power supply failures. All these problems are potentially more severe in an online system than in a system that performs only batch processing.
In batch systems, input data is usually prepared before processing begins, and jobs can be rerun, either from the start of the job or from some intermediate checkpoint. In online systems, input is usually created dynamically by terminal operators, and arrives in an unpredictable sequence from many different sources. If a failure occurs, it is generally not possible simply to rerun the application because the content and sequence of the input data is unknown. And, even if it is known, it is usually impractical for operators to reenter a day's work.
Online applications therefore require a system with special mechanisms for recovery and restart that batch systems do not require. These mechanisms ensure that each data set (resource) associated with an interrupted online application returns to a known state so that processing can restart safely.
An online system requires mechanisms that, together with suitable operating procedures, provide automatic recovery from failures and allow the system to restart with the minimum of disruption.
The two main recovery requirements of an online system are to maintain the integrity of data and to minimize the effect of failures.
Maintaining the integrity of the data means that the data is in the form you expect and has not been corrupted. The object of recovery operations on data sets, databases, and similar data resources is to maintain, and restore, the integrity of the information. Ideally, it should be possible to restore the data to a consistent, known, state following any type of failure, with a minimum loss of previous valid updating activity.
One way of doing this is to keep a record, or log, of all the changes made to a resource while the system is executing normally. If a failure occurs, the logged information can help recover the data.
The information can be used in two ways:
1. It can be used to back out incomplete or invalid changes to one or more resources. This is called backward recovery, or backout. For backout, it is necessary to record the contents of a data element before it is changed. These records are called before images. In general, backout is applicable to processing failures that prevent one or more transactions (or a batch program) from completing.
2. It can be used to reconstruct changes to a resource, starting with a backup copy of the resource taken earlier. This is called forward recovery. For forward recovery, it is necessary to record the contents of a data element after it is changed. These records are called after images. In general, forward recovery is applicable to data set failures, or failures in similar data resources, that cause data to become unusable because it has been corrupted or because the physical storage medium has been damaged.
In many cases, a data set failure also causes a processing failure. Then, forward recovery must be followed by backward recovery.
In some environments, a data set might need to remain online and open for update for extended periods. Normally, a backup copy of the data set cannot be taken while the data set is open. Thus, if a failure occurs that requires forward recovery, all updates that have been made to the data set since it was opened must be recovered. This means that all forward recovery logs that have been produced since the data set was opened must be kept. For a heavily-used data set that has been open for update for several days or weeks, much forward recovery could be needed.
Because of the above considerations, it was desirable to extend the methods for taking backups so as to allow a backup to be taken whilst a data set is open. This operation is known as backup while open (BWO). Any method which is used for taking a backup whilst the data set is open must be able to deal with the additional complications described in the following paragraphs.
European Patent application EP 0516900 discloses a method that does deal with these complications, where there is only a single updater. Such a method is implemented by a combination of the CICS, VSAM, CICS VSAM Recovery MVS/ESA and DSS products from IBM Corporation (IBM is a registered trademark and CICS is a trademark of IBM Corp). The method disclosed calculates a recovery time by referring to a block of storage associated with each Unit of Work that stores the time of the first log entry associated with that Unit of work. All of the blocks of storage are addressable by the single updater.
Data sets are updated by taking a copy of a part, such as a record, or all of the data set into a buffer in main memory. The copy in the buffer is then updated by the updater. When the updating is complete, the contents of the buffer are then copied back to replace the original data in the data set.
If a data set is being updated whilst a backup copy is being made, the backup copy thus obtained will require further processing before it can be used to recreate the original data set because:
Data residing in buffers at the start of the copy operation may not be reflected in the copied version of the data set; and
Updates made during the copy operation may not be reflected in the copied version of the data set.
These deficiencies may be remedied at restore time by using a forward recovery process. (Provided, of course, that the system performing the updates writes a forward recovery log.)
If a time can be established which precedes the time of creation of the oldest data held in buffers at the start of the copy operation, the missing data may be recreated by forward recovering from this time. This time will be referred to as the `Basic Recovery Time`.
An algorithm that allows a Basic Recovery Time to be calculated for use with a data set copy made whilst a single system is still updating the data set has been used for some time in a number of products from IBM Corporation, such as CICS, DSS, VSAM and CICS/VR. In a shared environment, resources may be updated by a number of systems concurrently. This algorithm only functions whilst the data set is open for update by a single updater because if there are multiple updaters then the Recovery Time that needs to be recorded is different for each of the updaters sharing the data set.
So it would be desirable to provide a method of taking backup copies of data sets when these data sets are open for update and are being updated by many systems in a shared environment. It would also be desirable to provide a method of recovering the original data sets from these back up copies and the forward recovery log or logs created by the updating systems.
The individual log records may contain tokens which can be mapped to the name of the data set which is being recovered. Tokens are used instead of the name of the data set in order to reduce the amount of data which has to be logged. The tokens are mapped to data set names by using additional log records called tie up records (TURs). For non-BWO backups, the forward recovery utility uses these TURs to apply the log records to the correct data sets.
The forward recovery process needs to access a set of all the relevant tie-up records. These will have been written before the Basic Recovery Time mentioned above. The most-recent time at which a full set of tie-up records have been written on the log is referred to as the `Recovery Time`.
It would be desirable to be able to easily determine this `Recovery Time` in a system with data sets open for shared update. This `Recovery Time` is then communicated to the forward recovery process.