The present invention relates generally to data recovery in a database management system after an abnormal system termination and, more specifically, to a database management system recovery method and apparatus that does not use data recovery logging.
Databases store data in a variety of manners depending on the internal organization. For example, a relational database system, typically stores data in tables. The tables are comprised of rows, each of which contains a record. The record, in turn, contains fields and the fields contain the actual related data values for a data xe2x80x9cobject.xe2x80x9d Each table may also be associated with one or more indexes, which provide rapid access to the rows in an order determined by the index and based on key data values contained in selected fields in each row. As an example, a row might be associated with each employee of an organization and contain fields that hold such information as the employee name, an identification number, and telephone numbers. One index might order the rows numerically by employee identification number, while another index might order the rows alphabetically by employee name.
Such a database conventionally includes methods which insert and delete rows and update the information in a row. When changes are made to the rows, any database indexes associated with the table may also need to be updated in order to keep the indexes synchronized with the tables. The rows in each table are mapped to a plurality of physical pages on the disk to simplify data manipulation. Such an arrangement is illustrated in FIG. 1.
In FIG. 1, table 100, which illustratively consists of rows 112, 114, 116, and 118, is mapped to a chain of pages which pages 120, 138, and 132 are shown. In the table illustrated, each row consists of five separate fields. For example, row 112 consists of fields 102, 104, 106, 108 and 110. The fields in each of rows 112, 114, 116 and 118 are mapped illustratively to page 138 which can contain data for more than one row. For example, field 102 maps to location 126 in page 138. Fields 104 maps to location 128. Field 106 maps to location 130. In a similar manner field 108 maps to location 124 and field 110 maps to location 134. The fields in the next row 114 are mapped directly after the fields in row 112. For example, field 111 is illustrated which maps to page location 136. When the page is completely filled with data, field information is mapped to the next page in the page chain. The pages are chained together by means of page pointers. For example, page pointer 122 links pages 120 and 138, whereas page pointer 140 links pages 138 and 132. All of the pages used to store the data in table 100 are linked together in a similar manner in a page chain.
The data pages are normally kept in a page buffer pool located in system memory. In order to make such a database system persistent or xe2x80x9cdurablexe2x80x9d, the data pages must be written to an underlying non-volatile storage system, such as a disk storage. This storage operation takes place on a page level so that when a modification is made to data on a page the entire page is stored in the persistent storage. Each page could be copied to the persistent storage as soon as data on the page was modified. However, this immediate copying greatly slows the system operation since persistent storage is generally much slower than RAM memory. Alternatively, the information in modified pages in the buffer pool can be copied or xe2x80x9cflushedxe2x80x9d to the disk storage at intervals. For example, the information could be flushed periodically or when the number of changed pages in the buffer pool reaches some predetermined threshold. During this disk flushing operation, the data modifications are performed xe2x80x9cin placexe2x80x9d so that the old data is either overwritten or deleted from the disk and lost.
Since the data is lost during the modification process, in order to ensure data integrity in the case of a system failure, or crash, the actions performed on the database are grouped into a series of xe2x80x9ctransactionsxe2x80x9d. Each transaction is xe2x80x9catomicxe2x80x9d which means that either all actions in the transaction are performed or none are performed. The atomic property of a transaction ensures that the transaction can be aborted or xe2x80x9crolled backxe2x80x9d so that all of the actions which constitute the transaction can be undone. Database transactions commonly have a xe2x80x9ccommitxe2x80x9d point at which time it can be guaranteed that all actions which comprise the transaction will complete properly. If the transaction does not reach the commit point, then it will be rolled back so that the system can return to its state prior to the initiation of the transaction. Consequently, if there is a system termination or crash prior to the commit point, the entire transaction can be rolled back.
The use of a buffer pool complicates transaction processing because even though a transaction has committed, system operation could terminate after a page has been modified, but before the modified page is flushed to disk. In order to prevent data loss caused by such a system interruption, a logging system is used to permit data recovery. The logging system records redo and undo information for each data modification in a special file called a xe2x80x9crecovery logxe2x80x9d that is kept in non-volatile storage.
The recovery log consists of an ordered list of redo/undo actions and contains information such as a transaction ID, a page ID, an offset length and the old and new data constituting the update. Additional control information is often included to facilitate the logging operation. This control data includes a transaction table which includes one record per active transaction that contains the transaction state (for example, running, committed or aborted.) The control information also includes a dirty page table which contains one entry for each page in the buffer pool which has been modified.
In order to ensure both atomicity and persistence for each transaction, a xe2x80x9cwrite aheadxe2x80x9d logging protocol is used. According to this protocol a log record is written for an update before the corresponding modified data page is written to disk. In addition, all log records are written for a transaction before the transaction is committed.
In addition to the recovery logging of data update information, recovery logging is also performed during storage space management procedures that involve allocation and deallocation of data pages for each database row. For example, a set of space-map pages are commonly maintained by the database system to handle storage space allocation and deallocation for each table. In general, each space-map page manages space allocation for a range of data pages and contains status information which indicates whether a particular data page on disk storage has been used. When a new data row is inserted into a table, the space-map pages associated with that table are examined and updated to allocate space for the row, if necessary. A recovery log entry is written for each change made to the space-map pages. These recovery logs can be used to free the allocated space if a transaction roll back occurs before the transaction has been committed.
The recovery logs are used to restart processing if system operation is abnormally terminated, for example, due to a power failure. In a recovery operation, redo information in the recovery log is used to reconstruct all transactions at the time of the failure. The undo information is used to undo transactions that did not commit prior to the termination.
The conventional database system is somewhat complicated in a distributed database system such as shown in FIG. 2 which illustrates, in schematic form, an example of such a distributed database system. The system consists of four database management systems 200, 202, 204 and 206. Each of these database management systems 200-206 contains a replicated copy of data that is in use throughout the system. Each of the database management systems communicates with and is accessed by a plurality of users. For example, database management system 200 may interact with a plurality of users, of which three users 208-212, are shown. Similarly, database management system 204 interacts with users 214-218. Database management system 202 interacts with users 220 and 224 and database management system 206 services users 226-230, Other configurations and numbers of users are within the spirit and scope of the invention.
In such a system, copies of the data maybe maintained on the databases in order to speed up access of the database by the users. In order to ensure data integrity in such a system, it is necessary to ensure that a transaction which affects data that has been replicated on several systems commits in each of the systems. For example, a transaction that affects data stored in system 200 must also be communicated to and commit on systems 202-206. A two-phase commit procedure is used to ensure that the transaction completes in each system. Unlike non-distributed databases where a single change is, or is not, made locally, all participating databases in the distributed database system must all commit, or all roll back, a transaction if there is a system or network failure at a single database. The two-phase commit procedure has two distinct processes that are accomplished in order. The first process occurs in a so called xe2x80x9cpreparexe2x80x9d phase where a global coordinator or initiating database requests that all distributed databases agree to commit or roll back a transaction. In most distributed database systems, depending on the transaction, any database could serve as the global coordinator. If all participants cannot prepare the transaction for commitment, the global coordinator will instruct all databases to roll back the transaction.
The second portion of the two-phase commit procedure is the actual commit phase which occurs after all participants have responded to the coordinator that they are prepared to commit a transaction. In the commit phase, the coordinator requests that all databases commit the transaction. If there is a machine, network or software failure that occurs during the two-phase commit process, the transaction tables in each database are used by the recovery process to resolve the status of any transaction.
The aforementioned recovery system works well with most database systems. However, for a table that is used to store large data objects, for example, images, large text documents, digitized video or digitized audio data (called xe2x80x9clarge objectsxe2x80x9d or LOBs), a single row could comprise many megabytes or even gigabytes. Since a data page is generally limited to 4000 bytes, a single such row must be mapped to many data pages. Such an arrangement is shown in FIG. 3. In FIG. 3, table 300, of which rows 306, 308, 310, and 312 are shown, is used to store data objects, including large data objects. For example, row 306 has a data object 302 and a large data object 304. Data object 304 may consist of many megabytes and therefore must be mapped to several pages. These pages can be connected together in a page chain as shown in FIG. 3. For example, page 318 stores a portion of the data object F2. The remainder of the F2 object is stored in page chain of which pages 314 and 322 are shown. This page chain is connected together with page pointers. For example, page pointer 316 points to page 318 and page pointer 320 points to page 322.
A space map associated with the table data illustrated in FIG. 3 is shown in FIG. 4. The map 400 is associated with a page chain of which pages 414, 416, 418 and 420 are shown. The page chain is connected together with page pointers. For example, page pointer 422 points to page 416, page pointer 424 points to page 418 and page pointer 426 points to page 420. The space map 400, which may be contained within one or more pages itself, contains data sections relating to each data page which can be used in the table. For example, data section 402 relates to page 414, data section 404 relates to page 416, data section 408 relates to page 418 and data section 410 relates to page 420. Data section 406 relates to a page which is not currently in the page chain, for example an unused page. Each data section contains information relating to its associated page. This information may be as simple as a single bit indicating whether the associated page is unused (free) or used. Other information relating to the page can also be included in each data section. When table rows are added, the space map is searched to find a page that is unused. When found, this page is added to the page chain and the associated data section in the space map 400 is marked to indicate that the page has been used. Conversely, if a table row is deleted, the data sections corresponding to the pages which store the data from the row are marked to indicate the associated pages are free. In order to allow for recovery, a recovery log is kept of all changes made to the space map. During rollback, changes to the space map can be reversed by sequentially applying xe2x80x9cundoxe2x80x9d operations to the space map from the log.
When a table is initially populated with these large data objects using the conventional row insert methods, the recovery logs, which typically include the new data which is inserted into the table, are also very large. It is difficult for a conventional database management system to manage such large recovery logs. Logging such large amounts of data not only slows the time to complete this load process, but also impacts the system""s overall performance due to contentions for logging devices and the overhead required to manage log archiving. One way to improve loading performance is to use parallel insert jobs. However, many database management systems support only a single log stream, and, in these systems, logging becomes a bottleneck.
One prior art solution to this problem is to provide an option to turn off logging for such types of tables during the loading process. However, the data inserted to such tables may be lost if a system termination occurs before the data pages in the page buffer pool are eventually flushed to persistent storage. Alternatively, in tables in which both logged and unlogged objects exist, the entire table may have to be declared invalid even though the logged objects could be recovered after a system failure.
Another prior art solution is to use a special database load utility to load the data without logging, instead of using the normal insert method to populate the table. However, the load utility process is a single thread process on the entire table and, consequently, it prevents concurrent read/write access by other applications during the loading process. Further, if any type of error occurs during the load process, the newly-loaded data must be deleted and completely reloaded.
Therefore, it is desirable to provide a method and apparatus which can load a database table with large objects without requiring recovery logging and yet be able to recover the loaded data if an abnormal system termination occurs at a subsequent time. It is also desirable to provide a method and apparatus which allows recovery of logged objects in a table that includes both logged and unlogged objects in the event that a recovery is necessary before a backup is performed.
In accordance with the principles of the invention, database tables which are to be loaded with large data objects are defined with a special logging attribute value. This attribute value causes the database management system to suspend recovery logging for the databases so defined, but to continue with logging control information, such as space management changes. Therefore, during a recovery operation, the control information for LOBs will be recovered as the log entries are applied, but the data will not be recovered because no data recovery entries are made. Normally, since some data entries are invalid, the entire table will be marked as invalid after recovery.
However, in accordance with a preferred embodiment, control information in the recovery log for each LOB has a special status indicator. This status indicator is set to cause the associated LOB entry to be marked as invalid when the log entry is reapplied, for example, during a recovery operation from a media failure.
Rather than invalidating the entire table, the database management system (DBMS) is modified to only prevent access to the LOB entries which have been marked invalid during the recovery process. For example, in accordance with one embodiment, the DBMS could return an SQL xe2x80x9cresource unavailablexe2x80x9d code to an application which attempts to access an invalid database entry. The DBMS can still access other entries which have been recovered. The selective invalidation of LOBS handles the situation where a database table logging attribute is altered from a xe2x80x9cLOG YESxe2x80x9d value to a xe2x80x9cLOG NOxe2x80x9d value and a media recovery is needed before a backup occurs. Alternatively, the selective invalidation handles a situation where a recovery to currency operation is mistakenly used on a table with unlogged objects instead of a recovery to copy operation.