The present invention relates to computer systems and, more specifically, to a method for controlling downtime during the recovery of database systems.
Most data processing systems include both volatile and nonvolatile memory devices. In general, volatile memory devices, such as random access memory, provide faster access times than nonvolatile memory devices, such as magnetic or optical disks. However, nonvolatile memory is generally less expensive and less susceptible to data loss.
To take advantage of the persistent nature of nonvolatile memory, an object, such as a data item in a database system, is typically stored on nonvolatile memory (i.e. database) until the object is required by a process. To take advantage of the speed of volatile memory, a copy of the object is loaded into volatile memory when the object is required by a process. Once the object is loaded into volatile memory, the process can quickly access and make changes to the copy of the object. At some later point in time, the copy of the updated object is written back to the database in order to reflect the changes that were made by the process.
For example, in a database system, a section of volatile memory known as a buffer cache is generally used by the processes for accessing and manipulating information contained within the database. In order for a process to access or change data that is stored in the database, a copy of the data is first loaded from the database into the buffer cache. After the data is loaded in the buffer cache, the process can then quickly access and manipulate the copied data version. At some later point in time, the contents of the buffer cache are written back to the database in order to reflect any changes that were previously made to the copied data version.
Typically, the buffer cache includes multiple buffers that are shared among one or more processes that are executing on a database server. When a process executes a transaction that requires a change to an item within a data block, a copy of the data item is loaded into a buffer in the buffer cache. Any changes are then made to the data within the buffer.
Because of the nature of volatile memory, various types of failures can cause the information contained within the buffers to be lost. If the volatile memory contains updated copies of data items, the changes may be lost if a failure occurs before the changes are written back into the database. In many applications, such loss of information is unacceptable.
Therefore, recovery techniques have been developed to reduce the possible loss of information due to failures within a database system. According to one approach, data is made xe2x80x9crecoverablexe2x80x9d whenever it becomes critical for the data to survive a failure. Data is considered to be xe2x80x9crecoverablexe2x80x9d when enough information to reconstruct the data after a failure is stored in nonvolatile memory. For example, in database systems it is considered critical that the changes made by a particular committed transaction be reflected within the database and the changes made by a particular aborted transaction be removed from the database.
One method of making the updated data recoverable is to write redo records into a redo log file in nonvolatile memory. The redo records contain a description of the changes that were made by a particular transaction (xe2x80x9cchange informationxe2x80x9d) that will enable a recovery process to reapply the changes in the event of a failure.
Specifically, whenever a transaction executes, space is allocated for redo records in both volatile and nonvolatile memory. The redo records are used to store change information about updates that a transaction makes to a particular buffer in the buffer cache. The change information is stored in the redo records in volatile memory and then later copied to nonvolatile memory.
In creating the redo records, a version identifier is associated with each redo record. The version identifier indicates the version of a particular data item associated with the update information contained in a redo record. After the redo record is copied into the redo log file, the version identifier is used in determining whether the data item in the database reflects the changes recorded in the redo record. In addition to the version identifier, each redo record in nonvolatile memory is associated with a byte offset that indicates where the particular redo record is located within the redo log file.
For example, FIG. 1 illustrates a redo-based recovery mechanism that can be used to perform changes that are recorded in a redo log file 118 in the event of a failure in the database system. As depicted in FIG. 1, database 128 and redo log file 118 reside within the nonvolatile memory 101 of database system 100. Conversely, buffer cache 102 and redo log buffer 112 reside within the volatile memory 103 of database system 100. Buffer cache 102 contains buffers 104, 106, 108, and 110 which respectively contain data loaded into volatile memory 103 from data items 142, 134, 130 and 138 within database 128. For the purposes of explanation, it shall be assumed that data items 142, 134, 130 and 108 are respectively data blocks A, B, C and D from the database 128.
Contained within redo log buffer 112 are redo records 114 and 116 which describe the changes made to data item 108 by a transaction (TX3). By the time transaction TX3 commits, the information that is contained in redo records 114 and 116 is stored in redo log file 118 as redo records 124 and 120 respectively. The version identifier associated with each redo record is copied into the redo log file and is used in determining whether the associated data item in the database reflects the changes that are recorded in the particular redo record.
If a database failure occurs, all information contained in volatile memory 103 may be lost. Such information may include buffers within buffer cache 102 that contain data items that have been updated by transactions, but that had not yet been saved to non-volatile memory 101. As mentioned above, it is essential for the committed updates made by all such transactions to be reflected in the persistently-stored data items within the database 128.
To ensure that updates made by transactions are reflected in the database 128 after a failure, redo records in the redo log file 118 are sequentially processed after a failure. A redo record is processed by reading the redo record from the redo log file 118 and then retrieving the data item identified in the redo record. The process performing the recovery (the xe2x80x9crecovery processxe2x80x9d) then determines if the change specified in the redo record is already reflected in the copy of the data item that is stored in the database 128. If the change is not reflected in the data item, then the change is applied to the data item. Otherwise, the change is not applied to the data item and the next redo record in the redo log file 118 is processed.
In a conventional redo-based approach to recovery, the recovery process determines whether the change identified in a redo record has already been applied to a data item by reading a version identifier from the data item and comparing the version identifier from the data item to the version identifier stored in the redo record. In a typical database system, determining whether a change identified in a particular redo record has already been applied to a data item requires the overhead of reading a data block that contains the data item into volatile memory and then comparing the version identifier associated with the data item to the version identifier stored in the redo record. If the version identifier stored in the redo record is newer than the version identifier associated with the data item, then the buffer that contained the updated data item had not been written from the buffer cache 102 back to the database 128 prior to the failure. Therefore, the change must be applied to the on-disk copy of the data item that is stored in the database 128. On the other hand, if the version identifier associated with the data item is at least as recent as the version identifier stored in the redo record, then the change does not need to be reapplied.
For example, assume that a failure occurs and all of the information stored in volatile memory 103 is lost. To determine whether the change in redo record 124 has already been applied to data item 130, data block C must first be read into volatile memory to obtain the data item 130 and version identifier 132. The version identifier 132 (xe2x80x9c99xe2x80x9d) is then compared with the version identifier associated with redo record 124 (xe2x80x9c100xe2x80x9d). If the version identifier associated with redo record 124 is newer than the version identifier 132 of data item 130, then the changes associated with redo record 124 had not been written back into data item 130 in data block C prior to the failure. On the other hand, if version identifier 132 of data item 130 data block C is at least as recent as the version identifier associated with redo record 124, then the changes associated with redo record 124 had been written back into data item 130 in data block C prior to the failure.
Although the redo log file 118 provides for the recovery of changes made by transactions that have not been applied to the database prior to a failure, it is inefficient to process all of the redo records of redo log file 118 when a high percentage of those records are for changes that have already been stored in the database 128. In addition, because the redo log file is continually growing, a recovery operation can become quite time consuming.
For example, in FIG. 1, upon a failure in database system 100, data item 142 and redo record 156 must be read into volatile memory 103 in order to compare version identifier 144 with the version identifier associated with redo record 156. The process of reading data item 142 and redo record 156 into memory creates unnecessary overhead, since version identifier 144 is newer than (i.e. greater than) the version identifier associated with redo record 156, and therefore the change recorded in redo record 156 is already reflected in database 128.
In order to reduce the number of data blocks and redo records that are unnecessarily read into memory during a recovery operation, a checkpoint operation may be periodically executed. During a checkpoint operation, all xe2x80x9cdirtyxe2x80x9d buffers that are currently stored in the buffer cache 102 are written into the database 128. A xe2x80x9cdirtyxe2x80x9d buffer is defined as a buffer in the buffer cache 102 that contains data that has been modified by a transaction but has not yet been written back to the database 128. After a checkpoint operation is performed, all changes identified in redo records that were contained in the redo log file 118 prior to when the checkpoint operation was initiated will be reflected in the database 128. Therefore, those records will not have to be processed after a failure.
To indicate which redo records in the redo log file 118 do not have to be processed after a failure, a xe2x80x9ccheckpoint valuexe2x80x9d is stored in nonvolatile memory 101. The checkpoint value indicates the boundary within redo log file 118 between redo records that must be processed after a failure and redo records that do not have to be processed after a failure. The checkpoint value may be, for example, a byte offset from the beginning of the redo log file 118, where all redo records that are stored in the redo log file before the location identified by the checkpoint value are guaranteed to be reflected in the database.
For example, as illustrated in FIG. 1, in executing the checkpoint operation on database system 100, a checkpoint process begins by storing a byte offset (i.e. the end of redo record 120 ) which represents where the next redo record is to be allocated in redo log file 118. The checkpoint process then marks as needing checkpointing all buffers in buffer cache 102 that contain changes since being loaded from database 128. After marking the appropriate buffers, the checkpoint process then writes the marked buffers within buffer cache 102 back to the database 128. After the dirty buffers are successfully written back to the database, the checkpoint 158 is set equal to the previously stored byte offset (i.e. end of redo record 120). Redo record 160 represents the beginning of the redo records that were stored in the redo log file 118 after the checkpoint operation began.
In the event of a subsequent failure, the recovery process can begin processing with redo record 160 (i.e. the record that follows checkpoint 158). The redo records that precede the checkpoint 158 (i.e. redo records 120, 124, 148, 152 and 156) may be ignored because the changes reflected therein have previously been written to database 128.
Because redo log files can potentially contain an extremely large number of redo records, performing checkpoint operations on the redo log file 118 can significantly reduce recovery time as the recovery process is no longer required to begin the recovery phase with the first redo record in the redo log file 118. Instead, the recovery process can begin the recovery phase at the latest checkpoint value. Thus, if a database system failure occurs, only those data blocks for which redo records were generated in the redo log file 118 after the checkpoint operation began will be required to be read into memory during recovery.
Because a checkpoint operation is commonly performed on a periodic basis, a xe2x80x9clatency periodxe2x80x9d typically exists between the time a checkpoint operation completes and the next checkpoint operation begins. During this latency period, a significant number of redo records will typically be written into the redo log 118 after the checkpoint value. These redo records correspond to changes that were made to xe2x80x9cnewxe2x80x9d dirty buffers that were read in from data blocks in database 128 and modified after the checkpoint operation completed. Thus, if a database system failure occurs, to process the significant number redo records, a large number of data blocks will typically be required to be read during recovery. Therefore, even using a checkpoint process to reduce recovery time, there is no guarantee or limit to the actual number of data blocks that will need to be accessed after a database system failure. Also, if a failure occurs prior to the completion of the checkpoint process, the previously stored checkpoint value must be used which will require an even greater number of data blocks to be read from database 128 during recovery.
In addition, because the buffer cache 102 can contain a large number of dirty buffers, in certain systems, a significant amount of time can elapse between when the checkpoint operation begins and when the checkpoint operation completes. Therefore, by the time a checkpoint operation completes, a large number of redo records may have been written into the redo log 118 after the checkpoint value. Again, these redo records correspond to changes that were made to xe2x80x9cnewxe2x80x9d dirty buffers that were read in from data blocks in database 128 and modified after the checkpoint operation began.
In certain cases, requiring a large number of data blocks to be accessed during recovery can result in unacceptably long downtimes. In addition, because the number of data blocks that need to be accessed during recovery can significantly vary, it is difficult, if not impossible, for a database administrator to predict the amount of downtime that will be required to recover after a database failure.
Therefore, based on the foregoing, it is highly desirable to provide a mechanism which can control the amount of downtime that is the result of a database failure.
A method and system for reducing overhead associated with recovering after a failure. According to the method, a checkpoint value is maintained that indicates which records of a plurality of records have to be processed after the failure. The plurality of records contain change information that corresponds to a plurality of data blocks. A target checkpoint value is determined based on a desired number of data block reads that will be required during a redo phase of recovery. Changes contained in volatile memory are then written to nonvolatile memory to advance the checkpoint value to at least the target checkpoint value.
According to another aspect of the invention, the record associated with the checkpoint value is identified. If a particular record is determined to have been stored in nonvolatile memory before the record associated with the checkpoint value, then the particular record is not processed. However, if it is determined that the particular record was not stored to nonvolatile memory before the record associated with the checkpoint value, then the particular record is processed.
According to another aspect of the invention, the target checkpoint value is determined using a circular queue of offset buckets. The offset buckets are used to store index values that are associated with buffers in the ordered list. The target checkpoint value is periodically set equal to an index value that is contained in an offset bucket.
According to another aspect of the invention, the target checkpoint value is determined by calculating a maximum number of records that should be processed after the failure. The maximum number of records is based on the desired number of data block reads that will be required during the redo phase of the recovery. The target checkpoint value is updated based on the maximum number of records.
According to another aspect of the invention, to advance the checkpoint value all buffers from a plurality of ordered lists that have an index value which are less than the target checkpoint value are removed. The index value that is associated with a smallest index value of all buffers located at the head of the one of the plurality of ordered lists is then written to nonvolatile memory as the checkpoint value.