This invention relates to database computer systems and applications that execute on them. More particularly, this invention relates to methods for backing up a database so that the data therein is recoverable from media failures.
Computer systems occasionally crash. A xe2x80x9csystem crashxe2x80x9d is an event in which the computer quits operating the way it is supposed to operate. Common causes of system crashes include power outage, application operating error, and other unknown and often unexplained malfunctions that tend to plague even the best-devised systems and applications. System crashes are unpredictable, and hence, essentially impossible to anticipate and prevent.
A system crash is at the very least annoying, and may result in serious or irreparable damage. For standalone computers or client workstations, a local system crash typically results in loss of work product since the last save interval. The user is inconvenienced by having to reboot the computer and redo the lost work. For servers and larger computer systems, a system crash can have a devastating impact on many users, including both company employees as well as its customers.
Being unable to prevent system crashes, computer system designers attempt to limit the effect of system crashes. The field of study concerning how computers recover from system crashes is known as xe2x80x9crecovery.xe2x80x9d Recovery from system crashes has been the subject of much research and development.
In general, the goal of redo recovery is to return the computer system after a crash to a previous and presumed correct state in which the computer system was operating immediately prior to the crash. Then, transactions whose continuations are impossible can be aborted. Much of the recovery research focuses on database recovery for database computer systems, such as network database servers or mainframe database systems. Database system designers attempt to design the database recovery techniques which minimize the amount of data lost in a system crash, minimize the amount of work needed following the crash to recover to the pre-crash operating state, and minimize the performance impact of recovery on the database system during normal operation.
FIG. 1 shows a database computer system 20 having a computing unit 22 with processing and computational capabilities 24 and a volatile main memory 26. The volatile main memory 26 is not persistent across crashes and hence is presumed to lose all of its data in the event of a crash. The computer system also has a non-volatile or stable database 28 and a stable log 30, both of which are contained on stable memory devices, e.g. magnetic disks, tapes, etc., connected to the computing unit 22. The stable database 28 and log 30 are presumed to persist across a system crash. The stable database 28 and log 30 can be combined in the same storage, although they are illustrated separately for discussion purposes.
The volatile memory 26 stores one or more applications 32, which execute on the processor 24, and a resource manager 34. The resource manager 34 includes a volatile cache 36, which temporarily stores data destined for the stable database 28. The data is typically stored in the stable database and volatile cache in individual units, such as xe2x80x9cpages.xe2x80x9d A cache manager 38 executes on the processor 24 to manage movement of data pages between the volatile cache 36 and the stable database 28. In particular, the cache manager 38 is responsible for deciding which data pages should be moved to the stable database 28 and when the data pages are moved. Data pages that are moved from the cache to the stable database are said to be xe2x80x9cflushedxe2x80x9d to the stable state. In other words, the cache manager 38 periodically flushes the cached state of a data page to the stable database 28 to produce a stable state of that data page which persists in the event of a crash, making recovery possible.
The resource manager 34 also has a volatile log 40 that temporarily stores computing operations to be moved into the stable log 30. A log manager 42 executes on the processor 24 to manage when the operations are moved from the volatile log 40 to the stable log 30. The transfer of an operation from the volatile log 40 to the stable log 30 is known as a log flush.
During normal operation, an application 32 executes on the processor 24. The resource manager 34 receives requests to perform operations on data from the application. As a result, data pages are transferred to the volatile cache 36 on demand from the stable database 28 for use by the application. During execution, the resource manager 34 reads, processes, and writes data to and from the volatile cache 36 on behalf of the application. The cache manager 38 determines, independently of the application, when the cached data state is flushed to the stable database 28.
Concurrently, the operations being performed by the resource manager on behalf of the application are being recorded in the volatile log 40. The log manager 42 determines, as guided by the cache manager 38 and the transactional requirements imposed by the application, when the operations are posted as log records on the stable log 30. A logged operation is said to be xe2x80x9cinstalledxe2x80x9d when it does not need to be replayed in order to recover the database state. This is usually accomplished by flushing the versions of the pages containing the changes made by the operation to the stable database 28.
When a crash occurs, the application state (i.e., address space) of any executing application 32, the data pages in volatile cache 36, and the operations in volatile log 40 all vanish. The computer system 20 invokes a recovery manager which begins at the last flushed state on the stable database 28 and replays the operations posted to the stable log 30 to restore the database of the computer system to the state as of the last logged operation just prior to the crash.
One prior art approach to database recovery is to require the cache manager to flush the entire cache state periodically. The last such flushed state is identified in a xe2x80x9ccheckpoint recordxe2x80x9d that is inserted into the stable log. During recovery, a redo test is performed to determine whether a logged operation needs to be redone to help restore the system to its pre-crash state. The redo test is simply whether an operation follows the last checkpoint record on the log. If so (meaning that a later operation occurred and was posted to the stable log, but the results of the operation were not installed in the stable database), the computer system performs a redo operation using the log record.
This simple approach has a major drawback in that writing every change of the cached state out to the stable database 28 is practically infeasible because it involves a high volume of input/output (I/O) activity that consumes a disproportionate amount of processing resources and slows the system operation. It also requires atomic flushing of multiple pages, which is a troublesome complication. This was the approach used in System R, described in Gray, McJones, et al., The Recovery Manager of the System R Database Manager, ACM Computing Surveys 13,2 (June, 1981) pages 223-242.
Crash recovery requires that the stable database 28 be accessible and correct. Media recovery provides recovery from failures involving data in the stable database. It is also a last resort to cope with erroneous applications that have corrupted the stable database. In some systems, to guard against stable database failures, the media recovery system provides an additional copy of the database called a backup database 29, and a media recovery log (e.g., stable log 30) is applied to the backup database 29 to roll its state forward to the desired state, usually the most recent committed state. To recover from failures, the media recovery system first restores the stable database 28 by copying the backup database 29, perhaps stored on tertiary storage, to the usual secondary storage that contains the stable database 28. Then the media recovery log operations are applied to the restored stable database 28 to xe2x80x9croll forwardxe2x80x9d the state to the time of the last committed transaction (or to some designated earlier time).
Backing up the stable database 28 is considered to be xe2x80x9con-linexe2x80x9d if it is concurrent with normal database activity, and is considered to be xe2x80x9coff-linexe2x80x9d if concurrent activity is precluded. Restoring the erroneous part of the stable database 28 with a copy from the backup database 29 is usually an off-line process. Media failure frequently precludes database activity, so the database usually has to be off-line during restore. Off-line restore has little impact on availability because restore only occurs after media failure which is a low frequency event. Off-line restore poses no technical problems unique to logical operations. High availability requires on-line backup. Thus, on-line backup is desirable, especially when logical operations, those that involve more than a single object or page, are logged.
Rolling forward the restored stable database 28 involves redo recovery, which, for logical operations, has been described in Lomet, D. and Tuttle, M., Redo Recovery From System Crashes, VLDB Conference, Zurich, Switzerland (September 1995) 457-468, and Lomet, D. and Tuttle, M., Logical Logging To Extend Recovery To New Domains, ACM SIGMOD Conference, Philadelphia, Pa. (May 1999) 73-84.
Traditionally, database systems exploit two kinds of log operations: physical operations and physiological operations. A physical operation updates exactly one database object. No objects are read, and data values to be used in the update come from the log record itself. An example of this is a physical page write, where the value of the target page is set to a value stored in the log record. A physiological operation, as described in Gray, J. and Reuter, A., Transaction Processing: Concepts and Techniques, Morgan Kaufinann (1993) San Mateo, Calif., also updates a single object, but it also reads that page. Hence, a physiological operation denotes a change in the page value (a state transition). This avoids the need to store the entire new value for the target page in the log record. An example of this is the insert of a record onto a page. The page is read, the new record (whose value is stored in the log record) is inserted, and the result is written back to the page.
These two forms of log operations (also called page-oriented operations) make cache management particularly simple. Updated (dirty) objects in the cache can be flushed to the stable database 28 in any order, so long as the write ahead log (WAL) protocol is obeyed. For databases, pages are the recoverable objects and records are frequently the unit of update. Both are small. Thus, the importance of simple cache management can be allowed to control the form of log operation, thereby restricting operations to the traditional varieties.
When extending recovery to new domains, the cost of logging may become the dominant consideration. Logical log operations can greatly reduce the amount of data written to the log, and hence reduce the normal system operation cost of providing recovery. A log operation is logical, as opposed to page-oriented, if the operation can read one or more objects (pages) and write (potentially different) multiple objects.
Some examples of how logical logging can substantially reduce the amount of logging required during normal execution, and hence reduce recovery overhead include application recovery, file system recovery, and database recovery, as described below.
Logical log operations for recovering an application state include: (1) R(X,Appl) in which an application xe2x80x9cApplxe2x80x9d reads an object or file X into its input buffer, transforming its state to a new state Appl"". Unlike page-oriented operations, the values of X and Appl"" are not logged. (2) Wl(Appl, X) in which Appl writes X from its output buffer and the application state is unchanged. Unlike page-oriented operations, the new value of X is not logged. (3) Ex(Appl) in which the execution of Appl between resource manager calls is a physiological operation that reads and writes the state of Appl. Execution begins when control is returned to Appl, and results in the new state when Appl next calls the resource manager. Parameters for Ex(Appl) are in the log record.
Logical log operations can reduce logging cost for file system recovery. A copy operation copies file X to file Y. This same operation form describes a sort operation, where X is the unsorted input and Y is the sorted output. In neither case are the data values of X or Y. logged. The transformations are logged with source and target file identifiers. Were page oriented operations used, one could not avoid logging the data value of Y or X.
Logical log operations are useful in database recovery, e.g., for B-tree splits. A split operation moves index entries with keys greater than the split key from the old page to the new page. A logical split operation avoids logging the initial contents of the new B-tree page, which is unavoidable when using page-oriented operations.
The logging economy of logical operations is due to the operand identifiers being logged instead of operand data values because the data values can come from many objects in the stable state. Because operands can have very large values, e.g., page size or larger, logging an identifier (unlikely to be larger than 16 bytes) is a great savings. With applications or files, values may be measured in megabytes.
Logical log operations complicate cache management because cached objects can have flush order dependencies. As an example, for the operation copy(X,Y), which copies the value of object X to the object Y, the updated value Y must be flushed to the stable database 28 before object X (if it has been subsequently-updated) is flushed to the stable database 28, which would overwrite its old value. If an updated X is flushed before Y is flushed, a system failure will lose the old value of X needed to make replay of the copy operation possible. Hence, a subsequent redo of the copy operation will not produce the correct value for Y. These flush dependencies complicate the task of high speed on-line backup.
In early database systems, the database was taken off-line while a backup was taken. This permitted a transaction or operation consistent view of the database to be copied at high speed from the xe2x80x9cstablexe2x80x9d medium of the database. Such off-line techniques work for log-based recovery schemes and permit high speed backup. However, the system is then unavailable during the backup process. Current availability requirements usually preclude this approach.
A conventional method of backup is a xe2x80x9cfuzzy dumpxe2x80x9d which depends upon constructing the backup by copying directly from the stable database 28 to the backup database 29, independent of the cache manager 38. Therefore, the state captured in the backup database is fuzzy with respect to transaction boundaries. Coordination between backup process and active updating when traditional log operations are used occurs at the disk arm. That is, backup captures the state of an object either before or after some disk page write, assuming I/O page atomicity. The backup database remains recoverable because page-oriented operations permit the flushing of pages to a stable database in any order. Because logged operations are all page-oriented, the backup database is operation consistent, i.e., results of an operation are either entirely in the backup database, or are entirely absent, and selective redo of logged operations whose results are absent from the backup database will recover the current active stable database.
The media recovery log includes all operations needed to bring objects up-to-date. The on-line system, which is updating the stable database and logging the update operations, does not know precisely when an object is copied to the backup database, and so is preferably xe2x80x9csynchronizedxe2x80x9d with the backup process to ensure the log will contain the needed operations. For page-oriented log operations, synchronization between backup and the cache manager only occurs at the beginning of the backup. (Data contention during backup to read or write pages is resolved by disk access order.) The media recovery log scan start point can be the crash recovery log scan start point at the time backup begins. The backup database will include all operation results currently in the stable database at this point, plus some that are posted during the backup. Hence, this log, as subsequently updated by later updates, can provide recovery to the current state from the backup database as well as from the stable database. Subsequently, backup is independent of the cache manager, and can exploit any technique to effect a high speed copy. This usually involves sweeping through the stable database copying pages in a convenient order, e.g., based on physical location of the data. Different parts can be copied in parallel as well.
Incremental backup methods have also been described in which a temporal index structure can be managed to ensure there is adequate redundancy so that recovery can restore the current state. But this approach cannot currently be exploited because database systems lack temporal index support.
Conventional database backup methods do not work with logical operations and cannot support an on-line backup involving high speed copying while update activity continues. The fuzzy backup technique described above depends on logged operations being page-oriented. But logical log operations can involve multiple pages (or objects) and updated objects (e.g., pages) must be flushed (copied) to the stable database 28 in a careful order for the stable database 28 to remain recoverable. Objects updated by logical operations have the same ordering constraints when flushed to the backup database 29 to ensure correct media recovery.
A fundamental problem is that flush dependencies must be enforced on two databases, the stable database 28 and the backup database 29. A xe2x80x9clogicalxe2x80x9d solution to this problem is to stage all copying from the stable database 28 to the backup database 29 through the cache manager, and flush dirty data synchronously (a xe2x80x9clinkedxe2x80x9d flush) to both the stable database 28 and the backup database 29. That is, dirty data flushed to the stable database 28 is also flushed to the backup database 29 such that the next flush of dirty data does not commence until the prior xe2x80x9clinkedxe2x80x9d flush to both the stable database 28 and the backup database 29 has completed. However, copying from the stable database 28 to the backup database 29 via the database cache is unrealistic for page-oriented operations because of the performance impact. Pursuing this for logical log operations, where xe2x80x9clinkedxe2x80x9d flushes are required, is even less realistic.
To efficiently create an on-line backup involves an xe2x80x9casynchronousxe2x80x9d copy process that does not go through the cache manager. But the task of keeping the backup database 29 recoverable so that media failure recovery is possible is the same task as with crash recovery and the stable database 28. Flushing is preferably restrained so that the flush dependencies that are enforced for the stable database 28 are also enforced for the backup database 29. Unfortunately, when an asynchronous on-line backup is in progress, flushing objects to the stable database 28 in the order required for crash recovery does not ensure that objects are flushed in the correct order to the backup database 29 for media recovery.
Thus, the present invention is directed to on-line database backup that copies data at high speed from the active stable database to the backup database while update activity continues. The inventor has developed an improved backup technique that correctly protects the database from media failures, while permitting the logging of logical operations. This technique provides a high speed, on-line backup. This ensures that the results of logged logical operations are flushed to the backup database in the correct order, and hence the backup database remains recoverable, without tight coupling with the cache manager.
This invention concerns a database computer system and method for backup when general logical operations are logged. Data is copied from the active stable database to a backup database while update activity continues. The stable database can be divided into disjoint partitions, and backup progress can be independently tracked in each partition. For each partition in which backup progress is to be tracked independently, two boundary values are maintained that separate objects into categories of pending, backed up, or in doubt. To permit backup to proceed with little synchronization between it and the cache manager, backup reports its progress only from time to time. Depending on system tuning considerations, the reporting can be made more or less precise by varying the granularity of the steps in which backup progress is reported.
According to one implementation, in a database computer system having a non-volatile memory including a stable log and a stable database comprising a plurality of objects, a cache manager for flushing objects to the stable database, and a backup database, a computer-implemented method comprises the following steps: (a) copying objects from the xe2x80x9cin doubtxe2x80x9d region of the stable database to the backup database, this region of the stable database being bounded by a first boundary value and a second boundary value; (b) adjusting the first boundary value and adjusting the second boundary value to define a new xe2x80x9cin doubtxe2x80x9d region of the stable database; and (c) continuing the copying objects from this further region of the stable database to the backup database. Preferably, steps (b) and (c) are performed until all of the objects of the stable database have been copied to the backup database. Objects bounded by the first boundary value have been xe2x80x9cbacked upxe2x80x9d. Objects bounded by the second boundary value are xe2x80x9cpendingxe2x80x9d.
Aspects of this invention include adjusting the first boundary value by setting the first boundary value equal to the second boundary value and adjusting the second boundary value by increasing the second boundary value by a predetermined increment.
According to other aspects of the present invention, each object is associated with a value, and the objects are copied in ascending order of value.
According to other aspects of the present invention, a backup latch is provided to prevent the boundary values from being altered while the cache manager is flushing pages.
According to other aspects of the present invention, the backup is synchronized with the cache manager at a predetermined time, such as when the values of the boundaries are being adjusted.
According to other aspects of the present invention, writing the value of an object to the log can avoid the need to include the object""s value in the backup.
According to other aspects of the present invention, some of the extra writing of objects to the log can be avoided by the tracking of backup progress as described above.
According to other aspects of the present invention, some of the extra writing of objects to the log can be avoided by restricting the logical log operations to a form called xe2x80x9ctree operationsxe2x80x9d.
The foregoing and other aspects of the present invention will become apparent from the following detailed description of the invention when considered in conjunction with the accompanying drawings.