The present invention relates generally to data recovery in a database management system after an abnormal system termination and, more specifically, to a database management system recovery method and apparatus that does not use data recovery logging.
Databases store data in a variety of manners depending on the internal organization. For example, a relational database system typically stores data in tables. The tables are comprised of rows, each of which contains a record. The record, in turn, contains fields and the fields contain the actual related data values for a data xe2x80x9cobject.xe2x80x9d Each table may also be associated with one or more indexes, which provide rapid access to the rows in an order determined by the index and based on key data values contained in selected fields in each row. As an example, a row might be associated with each employee of an organization and contain fields that hold such information as the employee name, an identification number, and telephone numbers. One index might order the rows numerically by employee identification number, while another index might order the rows alphabetically by employee name.
Such a database conventionally includes methods that insert and delete rows and update the information in a row. When changes are made to the rows, any database indexes associated with the table may also need to be updated in order to keep the indexes synchronized with the tables. The rows in each table are mapped to a plurality of physical pages on the disk to simplify data manipulation. Such an arrangement is illustrated in FIG. 1.
In FIG. 1, table 100, which illustratively consists of rows 112, 114, 116, and 118, is mapped to a chain of pages which pages 120, 138, and 132 are shown. In the table illustrated, each row consists of five separate fields. For example, row 112 consists of fields 102, 104, 106, 108 and 110. The fields in each of rows 112, 114, 116 and 118 are mapped illustratively to page 138, which can contain data for more than one row. For example, field 102 maps to location 126 in page 138. Fields 104 maps to location 128. Field 106 maps to location 130. In a similar manner field 108 maps to location 124 and field 110 maps to location 134. The fields in the next row 114 are mapped directly after the fields in row 112. For example, field 111 is illustrated which maps to page location 136. When the page is completely filled with data, field information is mapped to the next page in the page chain. The pages are chained together by means of page pointers. For example, page pointer 122 links pages 120 and 138, whereas page pointer 140 links pages 138 and 132. All of the pages used to store the data in table 100 are linked together in a similar manner in a page chain.
The data pages are normally kept in a page buffer pool located in system memory. In order to make such a database system persistent or xe2x80x9cdurablexe2x80x9d, the data pages must be written to an underlying non-volatile storage system, such as disk storage. This storage operation takes place on a page level so that when a modification is made to data on a page the entire page is stored in the persistent storage. Each page could be copied to the persistent storage as soon as data on the page was modified. However, this immediate copying greatly slows the system operation since persistent storage is generally much slower than RAM memory. Alternatively, the information in modified pages in the buffer pool can be copied or xe2x80x9cflushedxe2x80x9d to the disk storage at intervals. For example, the information could be flushed periodically or when the number of changed pages in the buffer pool reaches some predetermined threshold. During this disk flushing operation, the data modifications are performed xe2x80x9cin placexe2x80x9d so that the old data is either overwritten or deleted from the disk and lost.
Since the data is lost during the modification process, in order to ensure data integrity in the case of a system failure, or crash, the actions performed on the database are grouped into a series of xe2x80x9ctransactionsxe2x80x9d. Each transaction is xe2x80x9catomicxe2x80x9d which means that either all actions in the transaction are performed or none are performed. The atomic property of a transaction ensures that the transaction can be aborted or xe2x80x9crolled backxe2x80x9d so that all of the actions that constitute the transaction can be undone. Database transactions commonly have a xe2x80x9ccommitxe2x80x9d point at which time it can be guaranteed that all actions which comprise the transaction will complete properly. If the transaction does not reach the commit point, then it will be rolled back so that the system can return to its state prior to the initiation of the transaction. Consequently, if there is a system termination or crash prior to the commit point, the entire transaction can be rolled back.
The use of a buffer pool complicates transaction processing because although a transaction has committed, system operation could terminate after a page has been modified, but before the modified page is flushed to disk. In order to prevent data loss caused by such a system interruption, a logging system is used to permit data recovery. The logging system records redo and undo information for each data modification in a special file called a xe2x80x9crecovery logxe2x80x9d that is kept in non-volatile storage.
The recovery log consists of an ordered list of redo/undo actions and contains information such as a transaction ID, a page ID, an offset length and the old and new data constituting the update. Additional control information is often included to facilitate the logging operation. This control data includes a transaction table that includes one record per active transaction that contains the transaction state (for example, running, committed or aborted.) The control information also includes a dirty page table that contains one entry for each page in the buffer pool that has been modified.
In order to ensure both atomicity and persistence for each transaction, a xe2x80x9cwrite aheadxe2x80x9d logging protocol is used. According to this protocol a log record is written for an update before the corresponding modified data page is written to disk. In addition, all log records are written for a transaction before the transaction is committed.
In addition to the recovery logging of data update information, recovery logging is also performed during storage space management procedures that involve allocation and deallocation of data pages for each database row. For example, a set of pages is commonly maintained by the database system to handle storage space allocation and deallocation for each table. These pages are referred to as a space map, free space, a free space map or unused space. The term xe2x80x9cspace mapxe2x80x9d will be used herein to refer to all such space allocation areas and structures. In general, each space-map page manages space allocation for a range of data pages and contains status information that indicates whether a particular data page on disk storage has been used. When a new data row is inserted into a table, the space-map pages associated with that table are examined and updated to allocate space for a page, or the part of a page, which holds the row. A recovery log entry is written for each change made to the space-map pages. These recovery logs can be used to free the allocated space if a transaction roll back occurs before the transaction has been committed.
The recovery logs are used to restart processing if system operation is abnormally terminated, for example, due to a power failure. In a recovery operation, redo information in the recovery log is used to reconstruct all transactions at the time of the failure. The undo information is used to undo transactions that did not commit prior to the termination.
The conventional database system is somewhat complicated in a distributed database system such as shown in FIG. 2 which illustrates, in schematic form, an example of such a distributed database system. The system consists of four database management systems 200, 202, 204 and 206. Each of these database management systems 200-206 contains a replicated copy of data that is in use throughout the system. Each of the database management systems communicates with, and is accessed by, a plurality of users. For example, database management system 200 may interact with a plurality of users, of which three users 208-212, are shown. Similarly, database management system 204 interacts with users 214-218. Database management system 202 interacts with users 220 and 224 and database management system 206 services users 226-230. Other configurations and numbers of users are within the spirit and scope of the invention.
In such a system, copies of the data maybe maintained on the databases in order to speed up access of the database by the users. In order to ensure data integrity in such a system, it is necessary to ensure that a transaction that affects data that has been replicated on several systems commits in each of the systems. For example, a transaction that affects data stored in system 200 must also be communicated to and commit on systems 202-206. A two-phase commit procedure is used to ensure that the transaction completes in each system. Unlike non-distributed databases where a single change is, or is not, made locally, all participating databases in the distributed database system must all commit, or all roll back, a transaction if there is a system or network failure at a single database. The two-phase commit procedure has two distinct processes that are accomplished in order. The first process occurs in a so-called xe2x80x9cpreparexe2x80x9d phase where a global coordinator or initiating database requests that all distributed databases agree to commit or roll back a transaction. In most distributed database systems, depending on the transaction, any database could serve as the global coordinator. If not all participants can prepare the transaction for commitment, the global coordinator will instruct all databases to roll back the transaction.
The second portion of the two-phase commit procedure is the actual commit phase that occurs after all participants have responded to the coordinator that they are prepared to commit a transaction. In the commit phase, the coordinator requests that all databases commit the transaction. If there is a machine, network or software failure that occurs during the two-phase commit process, the transaction tables in each database are used by the recovery process to resolve the status of any transaction.
The aforementioned recovery system works well with most database systems. However, for a table that is used to store large data objects, for example, images, large text documents, digitized video or digitized audio data (called xe2x80x9clarge objectsxe2x80x9d or LOBs), a single row could comprise many megabytes or even gigabytes. Since a data page is generally limited to 4000 bytes, a single such row must be mapped to many data pages. Such an arrangement is shown in FIG. 3 that has been simplified for clarity. In FIG. 3, table 300, of which rows 306, 308, 310, and 312 are shown, is used to store data objects, including large data objects. For example, row 306 has a data object 302 and a large data object 304. Data object 304 may consist of many megabytes and therefore must be mapped to several pages. These pages can be connected together in a page chain as shown in FIG. 3. For example, page 318 stores a portion of the data object F2. The remainder of the F2 object is stored in page chain of which pages 314 and 322 are shown. This page chain is connected together with page pointers. For example, page pointer 316 points to page 318 and page pointer 320 points to page 322.
A space map associated with the table data illustrated in FIG. 3 is shown in FIG. 4. The map 400 can be associated with a page chain of which pages 414, 416, 418 and 420 are shown. The page chain is connected together with page pointers. For example, page pointer 422 points to page 416, page pointer 424 points to page 418 and page pointer 426 points to page 420. The space map 400, which may be contained within one or more pages itself, contains data sections relating to each data page which can be used in the table. For example, data section 402 relates to page 414, data section 404 relates to page 416, data section 408 relates to page 418 and data section 410 relates to page 420. Data section 406 relates to a page that is not currently in the page chain, for example, an unused page. Each data section contains information relating to its associated page. This information may be as simple as a single bit indicating whether the associated page is unused (free) or used. Other information relating to the page can also be included in each data section. When table rows are added, the space map is searched to find a new page that is unused or an existing page that has space for the new row. When found, a new page is added to the page chain and the associated data section in the space map 400 is marked to indicate that the page has been used. If an existing page is used, the space map 400 is marked to indicate that a previously unused section is now used. Conversely, if a table row is deleted, the data sections corresponding to the pages which store the data from the row are marked to indicate the associated pages are free. In order to allow for recovery, a recovery log is kept of all changes made to the space map. During rollback, changes to the space map can be reversed by sequentially applying xe2x80x9cundoxe2x80x9d operations to the space map from the log.
When a table is initially populated with these large data objects using the conventional row insert methods, the recovery logs, which typically include the new data which is inserted into the table, are also very large. It is difficult for a conventional database management system to manage such large recovery logs. Logging such large amounts of data not only slows the time to complete this load process, but also impacts the system""s overall performance due to contentions for logging devices and the overhead required to manage log archiving. One way to improve loading performance is to use parallel insert jobs. However, many database management systems support only a single log stream, and, in these systems, logging becomes a bottleneck.
One prior art solution to this problem is to provide an option to turn off logging for such types of tables during the loading process. However, the data inserted to such tables may be lost if a system termination occurs before the data pages in the page buffer pool are eventually flushed to persistent storage.
Another prior art solution is to use a special database load utility to load the data without logging, instead of using the normal insert method to populate the table. However, the load utility process is a single thread process on the entire table and, consequently, it prevents concurrent read/write access by other applications during the loading process. Further, if any type of error occurs during the load process, the newly loaded data must be deleted and completely reloaded.
Therefore, it is desirable to provide a method and apparatus which can load a database table with large objects without requiring recovery logging and yet be able to recover the loaded data if an abnormal system termination occurs at a subsequent time.
In accordance with the principles of the invention, database tables which are to be loaded with large data objects are defined with a logging attribute which can be used to enable recovery logging and space map logging or to suspend recovery logging, but to continue with space management logging. During a two-phase commit operation for a transaction in a distributed system, the present invention insures that, at each database, all modified data pages are written to disk before the end of the phase 1 commit, if the pages have not been previously written. This write operation is monitored for errors. If any of the changed data pages cannot be written to disk successfully before the end the phase 1 commit operation, then the transaction will be rolled back at all databases. During rollback, the recovery log is used to undo changes to the space map, thereby deleting large data objects that were inserted without data logging. This frees the memory used by the inserted objects.