1. Field of the Invention
This invention generally relates to the area of data base management systems and more particularly to the area of data base backup and recovery.
2. Description of the Prior Art
Businesses and society in general are increasingly relying on the availability of data processing systems and the information they process. The cost of a failed data processing system to a business can be enormous, both in terms of idled resources and the opportunity costs associated with unprocessed information. There may be employees whose Jobs are directly tied to the availability of the system, as well business transactions which are tied to system availability. Thus, when a data processing system fails, there may be costs associated with unproductive labor and lost business opportunities such as sales of airline tickets, sales of shares of stock, or transfers of money. In short, data base availability means the difference between success and failure in the marketplace.
One application for which data processing systems are used and for which system availability is of great importance is in the area of transaction processing. Transaction processing generally entails maintaining a data base of information and processing user requests against the data base. The user requests would typically be for reading, adding, deleting, and updating information in the data base. Each such operation is commonly referred to as a transaction.
Systems which provide transaction processing services are susceptible to failures in at least three different areas of processing, as indicated by Haerder and Reuter in "Principles of Transaction-Oriented Data base Recovery", Computing Surveys, Vol 15, No. 4, December, 1983, pp 287, 290. The failures can be at the transaction level, the media level, or at the system level.
Transaction level failures include things such as incorrect input data and deadlock situations, both of which would keep a transaction from being completely processed. In most instances transaction level failures can be remedied by attempting to reprocess the transaction.
Media level failures include secondary storage device failures, and operating system bugs, both of which can result in failed read or write operations on the secondary storage device. Secondary storage devices are used for non-volatile storage of the data base, and are also useful in applications where the amount of data in the data base exceeds the primary storage capacity of the host system. In the event of a media failure, and especially in the case of a failed secondary storage device, all or part of the information in the data base may be inaccessible. The present invention provides a fast and reliable system for data base recovery as will be discussed in detail later.
The third category of failure discussed by Haerder and Reuter is the system failure. System failures include such things as bugs in the data base management system code, operating system faults, general hardware failures, and natural or human disasters. The consequence of such a failure is that the host system for the transaction processing system is generally unavailable and no further transactions can be processed. Until the system can be returned to an operable state, the activities of those relying on the transaction processing system will be sharply curtailed, if not stopped completely. For system failures caused by disasters such as floods, fires, and acts of terrorism, it would be highly unlikely that the system could be returned at all to an operable state. Protection against disasters is quite often provided by a standby system which is available at a site which is far enough removed from the site suffering the disaster to remove the backup system from the threat of being affected by the same disaster.
The period of time for which a transaction processing application user is willing to forego use of the system due to its unavailability largely depends upon the nature of the application. For example, a home user updating a mailing list application on a system that fails may not feel a great sense of urgency to recover the data held in the inaccessible data base; whereas a large bank whose transaction processing application updates hundreds of thousands of accounts on a daily basis would experience an extreme sense of urgency in recovering its data base. The willingness to invest in the extra equipment necessary to ensure a fast recovery time is largely dependent upon the tradeoff between the down-time costs associated with an unavailable system and the hardware costs to provide a standby system.
Two commonly used methods for data base backup and recovery include the audit trail method and the synchronized data base method. Each is discussed briefly below.
The audit trail method involves making a copy of a primary data base at a selected time. The primary data base is the data base against which transactions are processed, and the copy is referred to as the "backup data base". The backup data base may be stored in a non-volatile storage medium, such as one or more magnetic tapes. After making the backup copy, audit information relating to all updates to the primary data base is logged to an audit trail, which may be stored on a magnetic tape or other non-volatile storage medium. The particular audit information saved may vary from system to system and may include one or more of the following: the new updated record, the old record, the difference between the old record and new updated record, and the operation performed on the record. Only transactions which cause updates to the data base need be logged since operations such as a read will have no effect on the state of the data base.
Should the primary data base become inaccessible, the data in the primary data base must be recovered. The state of the primary data base just after the last update operation can be reconstructed using the backup data base and the audit trail. The backup data base is read from the magnetic tapes and written to a direct access storage device. Then, each entry in the audit trail is sequentially read and applied to the data base on the direct access storage device. When the entire audit trail has been processed, the backup data base will be in a state identical to that of the primary data base Just prior to when it became inaccessible. The backup data base can then be made the primary data base against which subsequent transactions can be processed.
It is recognized that with the audit trail method, the length of time required to recover a data base is directly dependent upon the number of data base updates logged to the audit trail. The number of data base updates logged to the audit trail is directly dependent upon the transaction processing application for which the audit trail is kept. For large banks, the application will most likely involve millions of update transactions, and therefore generate a lengthy audit trail. Thus, a lengthy audit trail will increase the time required to recover the data base.
To ameliorate the lengthy recovery time involved with the audit trail method, a second method, referred to as the "synchronized data base method" is also used for data base recovery. The synchronized data base method involves a primary data base on which transactions are first processed, and secondary data base to which update transactions are applied in order to keep the secondary data base synchronized with the primary data base.
The synchronized data base method involves two hosts: the first host having the transaction processing system for the primary data base, and the second host having a transaction processing system for the secondary data base. Before the transaction processing system for the primary data base completes an update transaction, the update is sent to the transaction processing system for the secondary data base. Once the transaction processing system for secondary data base has received the update transaction, two processing possibilities exist. First, the transaction processing system for the secondary data base could send an acknowledgment to the transaction processing system for the primary data base before the transaction has been applied to the secondary data base to signal that the update transaction has been received. The second approach has the transaction processing system of the secondary data base sending an acknowledgement after the update transaction has been applied to the secondary data base and stored in secondary storage.
The first approach has the benefit that the transaction existence time is not unduly prolonged by virtue of sending the update operation to the transaction processing system for the secondary data base. However, this approach has the drawback that the update transaction is still stored in the primary storage of the second host and has not been applied to the secondary data base secondary storage. The risk taken by this approach is that if the host for the transaction processing system of the primary data base fails, and the host for the transaction processing system of the secondary data base fails before the update transaction can be applied to the secondary storage of the secondary data base, the update transaction will be permanently lost, even though the transaction processing system of the primary data base has proceeded under the assumption that the transaction was secured by the secondary transaction processing system.
The advantage to the second approach is that once the host system with the primary data base has received an acknowledgement, the data base update is sure to be secured in the secondary data base. However, the disadvantage with this approach is that the transaction existence time for the update transaction may be lengthened by the time it takes for the transaction processing system of the secondary data base to make the necessary update to secondary storage. This increase in processing time is due to the fact that the access speed to secondary storage is much slower than the access speed to primary storage. The tradeoff for the security of having the data base update committed to secondary storage is an increase in processing time for a transaction. For applications processing large numbers of transactions this delay may be unacceptable.
The current state of data base backup and recovery systems involves the tradeoff between ensuring data security and minimizing the time to recover a data base. Therefore, it would be desirable for a system to provide a secure backup data base and quick recovery of a data base, without adversely impacting the transaction processing rate.