The present invention works in the context of the Tandem "remote data facility" (RDF) technology disclosed in U.S. patent application Ser. No. 08/790,544, filed Jan. 30, 1997, which is hereby incorporated by reference as background information.
The best known and most widely used method of synchronizing a backup database with a primary database uses a procedure sometimes called Backup/Restore (i.e., backing up the database to tape and restoring from tape to a new disk or set of disks). This method generates a snapshot of the database in an internally consistent state, and requires significant downtime for applications on the primary system while the database files are being backed up. As databases have increased considerably in size over the years, the downtime for applications has correspondingly increased when having to backup the database files. Some databases are so large that the Backup/Restore method literally requires weeks to synchronize a backup database with its primary database. Since the applications on the primary system must be turned off during this entire process, this is not an acceptable solution in most situations.
As an alternative, a backup database can be generated by detaching one disk of each mirrored pair in their primary system and then reviving these mirrors against unmirrored disks on their backup systems. This method can be accomplished relatively quickly but also carries some risk that many database customers are unwilling to take. For example, after having detached the mirror ($A') on the primary, one would roll in a new disk and revive it against the existing disk ($A). If, however, one encountered any hardware problems in reviving $A' and if one was unwilling to run one's applications on unmirrored disks, then one could be faced with an extended outage, depending on the nature of the hardware failure.
Increasingly, customers with massive databases are insisting that they cannot bring their applications down at all. That is, the backup database has to be generated, initialized and synchronized with the primary database without having any impact whatsoever on the applications that are using and modifying the primary database, and further these customers are unwilling to risk running the primary system without mirrored disks.
A different but related problem involves the case where one has temporarily lost one's primary system, switched operations to the backup system, and has been running the applications at that backup system for some time. When the original primary system comes back online, how does one synchronize the databases? If the outage on the primary was planned, then there is a standard procedure whereby one can bring the database on the primary back into synchronization with the database on the backup system. Essentially, the primary system is resynchronized with the backup system by creating and storing the audit records for all committed transactions performed on the backup system, and then performing "redo" operations for those audit records on the primary system.
If, however, the outage was unplanned, then when the primary comes back online, one must perform a full database synchronization.
Previously considered solutions all boil down to the common theme of developing a program that would open database files, create duplicates on the remote system, and then read through the primary system's files, deleting and re-inserting each record.
Because the files are audited, the deletes and inserts would generate audit records that would be sent to the backup system, and then applied to the backup database. Due to the number of disk operations required, and the use of record locks that compete with record locks by the applications, the delete/insert technique is very time consuming.
There are well known problems with this technique. As an example of the type of problem caused by this technique, suppose an application updates a record in a given file before the delete/insert program replicates the record on the backup system. The Updater will encounter a file system error when attempting to apply that update because the record does not even exist in the backup database file yet. Should this error be suppressed? The delete/insert technique will also produce an error when the Updater applies the delete operation of a delete/insert pair because the record does not yet exist in the backup database. Should one suppress all such errors, since these errors are really just expected parts of the replication process? How does one inform the Updater when to suppress such errors and when not to suppress them?
There are undoubtedly ways to solve all problems associated with this delete/insert method of database synchronization, but it is not clear that even all the problems with it have been identified yet, and some of the known problems remain unsolved.
Taking into account bandwidth requirements, the amount of time it would take the delete/insert program to perform all its tasks, and the amount of time it would take to transmit the resulting audit records to the backup system and apply it to the backup system, the time required to complete database synchronization would be considerable. Performing this synchronization method on a system that is performing hundreds of transactions per second will obviously slow down the synchronization process further. It is estimated that the delete/insert synchronization method would likely take weeks to complete for a database of a few hundred gigabytes. Clearly, such a long database synchronization time is unacceptable.
It is a goal of the present invention to provide a system and method for generating and synchronizing a backup database with a primary database efficiently and relatively quickly, even when the database being replicated has hundreds of gigabytes of data.
Another object of the present invention is to provide a system and method for creating a fuzzy copy of a primary database, to be used as a backup database, and for then efficiently synchronizing the fuzzy backup database with the primary database.