It is well known that computerized database systems have gained wide acceptance in numerous applications. The data accumulated therein often represents vast amounts of expense and effort and is extremely valuable if not vital to the user, whereby data loss can be quite serious and costly.
Accordingly, in addition to the more conventional function of database systems in storing and manipulating data, they must also provide an additional important function of data recovery in the event of system crashes wherein normal processing ceases. One difficulty in providing recovery mechanisms was the requirement that the database be restored to a consistent state. A classic illustration of the problem of inconsistency occurs in the case of banking applications for example. A debit to a customer's account record on disk may be made followed by a crash of the bank's database system before a credit is made to a correlative account. The credit action might even have been completed in the sense of being entered in main or RAM buffer memory but simply not written to disk yet. The image of the database out on the permanent storage disk was thereby said to be in an inconsistent state.
To solve the problem, the notion of transaction boundaries developed in the art which bounded discrete sets of database actions whereby all actions such as the debit and credit actions in the example between boundaries if completed would always leave the database in a consistent state. In other words, by transaction control in a database with respect to a sequence of updates to the database either all of the updates of the sequence would be completed or none. In the event of a problem and need to recover the database after a crash before a transaction was completed operations on the database could be backed out to these transaction boundaries.
In addition to the consistency problem addressed by the concepts of transaction boundaries and commits, yet another difficulty associated with database recovery related to the problem of storage of prior data images in the event the database had to be restored to those images at recovery. In the simple previous example, this might mean retaining the original state of the two accounts in the event that the partial or in-flight transaction of only the debit action occurred. The transaction could then be backed out so that the database was left in the original consistent state prior to the incomplete transaction changing the database.
One technique which developed in the art for retaining prior image data for recovery purposes was known as shadowing or shadow paging. Early database systems employing this technique included System R, and the commercial database product SQL DS developed by the IBM Corporation, Armonk, N.Y. In this technique copies of historical data pages were retained. At transaction commits, a new copy of pages from the historical copy was made including the changes which became the new committed copy and the previous copy was deleted. Each time additional database changes occurred, the present committed copy was copied with the changes into a new committed copy of the data pages and the prior correlative data page deleted. A performance benefit of this technique occurred at commit times because changing from the old to the new data pages involved merely changing a pointer. There was no necessity to redo database actions as in the case of the later technique of write-ahead logging to be next described.
Nevertheless numerous well known disadvantages of the shadowing techniques became evident giving rise to development of algorithms supporting the write-ahead logging. These disadvantages included extra RAM and disk space and associated overhead for maintaining the shadow copies of data because in updating every page in the database second copies were required. Additional drawbacks included costly checkpoints, disturbing the physical clustering of data resulting in data fragmentation, inefficient maps, and extra I/O for page map blocks.
In summary it was operationally found that the penalties of shadow paging were too great and thus write-ahead logging was developed and thought to be a better solution to the recovery problem. An early example of this technique may be found in the aforementioned SQL/DS database system. In this technique instead of retaining an entire second copy of data, a linear record or journal is simply retained of what was done before and after a database action. Any action represented in RAM buffers even with respect to a completed transaction will not necessarily have been written out on disk. Accordingly, such work will be lost upon system crash and must be redone on recovery from the recovery log unlike in the case of the aforementioned shadow paging technique.
Under the write-ahead logging protocol in general, logged records corresponding to the changes must first be written to disk before any correlative data pages with the changes are written to the data file. One important aspect of write-ahead logging relates to checkpointing whereby points in the recovery log at which recovery should begin in the event of a system crash are periodically determined and written out. One factor upon which efficient checkpointing depends is the internal structure of the database and, more particularly the granularity of locking. In multiple user databases, a well known problem arises when more than one user is seeking to access the same data. Returning to the previous example, while a first user is reading the aforementioned debit account a second user may be changing the credit account. While the latter user's transaction is in-flight, the first user may then access the same credit account yielding inconsistent results.
A solution to the problem has been to restrict access to a portion of the database to one user, such restriction being referred to as a lock and the size of the portion of the database restricted being directly related to the "granularity" of the lock. In the previously mentioned SQL/DS database, for example, granularity of locking was at the physical data page level which encompassed many records. The significance of such large granularity in terms of database recovery was that it simplified the problems of recovery in checkpointing. The ability to commence recovery at an optimum checkpoint was a simpler problem than in systems of sub-page granularity wherein for example locking on a record level base is effected.
To illustrate why larger granularity simplifies recovery in checkpointing problems, in the case of record locking the physical page moving data in from disk and out from a RAM buffer may contain updates from more than one transaction. Moreover, of these transactions one may be committed, one may be aborting, and one may be stopped at the point of a system crash. At that point, in a database system such as DB2 with larger granularity locking, the page would at most be affected by one transaction. Consequently it would be relatively easy to deal with recovery related to that page. However, with respect to sub-page level granularity, the same page due to concurrent updates facilitated by such sub-page locking, might have multiple transactions running on it making recovery more difficult because all related database changes must then be made in the proper sequence.
Notwithstanding the simpler recovery algorithms associated with locking granularity at the data page level wherein for example it is relatively easy to keep track of whether a data page has been paged out and an update needed, developments and interest increased in provision for sub-page granularity locking and resultant recovery for multiple transactions within a data page. One reason for the desire for such concurrency was that page level locking related to a physical size of the page. However, databases operate in terms of logical objects or entities rather than arbitrary physical limitations, and thus locking at a table or record level on logical data objects was highly desired, i.e., sub-page level granularity such as on a record level. Data pages may exist in the database for purposes of I/O however the locks themselves should preferably functionally not be so limited. Thus, it was desired to provide for multiple transactions with locks on records within a single page for concurrency purposes in the context of the write-ahead logging technique for recovery.
Accordingly, techniques were developed for sub-page granularity locking and recovery techniques for multiple transactions within a data page. In one technique, write-ahead logging was provided with locking granularity finer than the size of the data page (or unit used to bring data in or write data out within the database). During normal database operations, the technique would write out or "log" detailed information about the state of the buffer pool. More particularly, status of the pages in the RAM buffer pool (i.e., which pages were "dirty", actions which dirtied them, with associated log sequence numbers or LSNs, etc.) would be periodically written out to a recovery log and put in a record containing this check point information. In addition to this data necessary for recovery, log records of the changes to the actual data in the database were also written out to the recovery log. The entire log was thus a sequence of records comprised both of checkpoint information as well as the actual data updates to the database.
At recovery time, an analysis pass would occur. During this phase, a forward pass would be made to the log, wherein all of the log records containing checkpoint information which were previously written to the log would be read out and analyzed. The optimal checkpoint at which to begin recovery would then be calculated from this information during the analysis pass.
More particularly, from the numerous checkpoint records two LSNs would be determined: an LSN related to a first update to the earliest of "dirty" pages in the buffer; and an LSN corresponding to a first update to the earliest transaction still in flight or "uncommitted". The optimal recovery point would correspond to the minimum of these LSNs. These LSNs calculated during the analysis pass will hereinafter be understood to correspond to MINBUFLSN and LOWTRANLSN of the present invention which are periodically determined and stored during normal forward processing in accordance therewith. As used herein "LSN" will refer to "log sequence number"; "LOWTRANLSN" will refer to "low transaction log sequence number"; and "MINBUFLSN" will refer to "minimum buffer log sequence number."
One problem with such prior techniques was that the entire log of the buffer pool had to be scanned (i.e., an analysis pass was required) to read all the log records just in order to calculate these two LSN values from which the optimal recovery point could be determined. This is to be contrasted with the present invention wherein the MINBUFLSN and LOWTRANLSN values are periodically determined and written to the log record as aforesaid, thereby obviating the need for an analysis pass.
Although in the prior art actual instantaneous values for MINBUFLSN and LOWTRANLSN might be more current than those last written to the log in accordance with the invention thereby yielding a more optimal recovery point, the latter or nevertheless immediately available upon recovery without the necessity for the analysis pass.
Accordingly, with the foregoing in mind, it is readily apparent that a novel technique was desired for soft checkpointing. Such a technique was desired which avoided hereinbefore noted disadvantages of shadowing while at the same time providing for sub-page concurrency and recovery from multiple page transactions within a data page. Such systems and methods were further desired for database recovery which were both quick and efficient particularly in terms of overhead required during run time to support the technique, and which avoided requiring an analysis phase on recovery.