1. Field of the Invention
This invention relates in general to database management systems performed by computers, and in particular to an optimized method and system for minimizing data outage time and data loss while handling errors detected during data recovery.
2. Description of Related Art
Databases are computerized information storage and retrieval systems. A Relational Database Management System (RDBMS) is a database management system (DBMS) which uses relational techniques for storing and retrieving data. RDBMS software using a Structured Query Language (SQL) interface is well known in the art. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American National Standards Organization (ANSI) and the International Standards Organization (ISO).
A typical relational database management system includes both database files and index files. The database files store data in the rows and columns of tables stored on data pages while index keys, used for faster reference of the data, are stored on index pages. A page is a physical unit of transfer between main storage and secondary storage. In such a table, the rows may correspond to individual records while the columns of the table represent attributes of the records. For example, in a customer information table of a database management system, each row might represent a different customer data object while each column represents different attributes of the customers, such as the name of a particular customer, the amount owed by the customer and the cash receipts received from the customer. The actions of a transaction that cause changes to recoverable data objects are recorded in a log file or data set.
Generally, many transaction processing systems, Data Base Management Systems and application programs have a need for enhanced transaction control for databases. Transaction control is well understood in the art and is often used to guarantee the integrity of enterprise databases. Guaranteeing the integrity of databases in the face of concurrent execution of multiple transactions and various failures is a very important issue with enterprise databases. Indeed, many methods have been developed in the past to deal with these problems for enterprise databases. Specifically, one such method, utilized in many transaction processing systems, database management systems and applications includes an efficient technique for DB2 database data recovery described below.
Area of integrity constraints remains of great concern in providing for recovery of database data because a vast amount of time and money is associated with the compilation of data resident in the database. Moreover, users have fear of losing data because of the great dependence which they have on their increasingly vital database resources. Thus, a general approach used in most database systems provides for redundancy by creating backup copies of the data for the event of database reconstruction and tracks transactions/operations in log files so that transactions can be rolled back and rerun after an error in the same order as the original set of transactions.
Many DBMS, such as the DB2, have utilities for detecting corrupted data and transactions and recovery utilities for recovery from corruption. They log the transactions and recover data objects and restore them in a database to an uncorrupted state after a software or hardware error. One exemplary utility is the Recover Utility for DB2 UDB for z/OS which uses log records of applications' or DBMS' transactions which are stored in log data sets, files or journals so that the transactions can be redone from the log records. Ranges of updates to the data objects are kept track of in a log range table in order to limit the amount of log records read during recovery. DB2 UDB for z/OS Recover Utility recovers an object or a list of objects by restoring the copy of each such data object from an image copy, usually saved during a regular data backup of objects from tablespaces and indexspaces. It then applies to the appropriate data or index pages the saved transactions from the log records in order to redo the transactions, such as inserts/updates/deletes, that may have occurred since the image copy was taken.
The log records are read from the active log data sets and the archive log data sets which may reside on magnetic tape devices or disk drives. In a data sharing multi-process, the log records must be read from each system separately and merged in sequence with log records from the other shared DB2 systems before they can be applied to redo inserts/updates/deletes on the data or index pages. The reading, which is done serially within each system, and the merging of the log records can be a time consuming process.
During the application of the log records to data or index pages in the recovery process, error checking is performed in order to ensure that data integrity is preserved. If it is determined that applying a log record to a data or index page will make the data inconsistent, the log record and all subsequent log records that update this data or index page are not applied. Instead, an error message is issued and a problem determination action has to be performed. Thus, a dump for each data or index page that encountered an error during log apply is saved and all these data or index pages are flagged with an “inconsistent” status so that the data on the flagged pages are made inaccessible by the users during this data outage time.
The problem determination actions, performed after the error determination, are time extensive which can make the data on the flagged pages inaccessible for an extended time period. Presently, these actions include contacting an IBM Support Center to perform a manual problem determination on received dump(s) by analyzing the dump(s) to determine which log record(s) could not be applied and why. Analyzing the dump(s) requires in-depth knowledge of the internal processes of DB2 and requires time to locate ranges of the problematic log record(s) that could not be applied to the image copy. Once these ranges have been determined, a print log tool must be used to print the contents of the log records either at the customer site or at the IBM Support Center. It is preferable for the log data sets to be sent in by the customer so that the IBM Support Center personnel can run the print log tool and analyze the output. In that case, for DB2 systems with a high workload, a large number of active and archive log data sets must be sent in, and thus the number of log records read from the log data sets and merged by the print log tool, in the data sharing case, could be tremendous, resulting in a prolonged run time for the print log tool.
The print log tool must be run for each data object that encountered an error separately, since it can only print the log records for one object at a time. Moreover, the run time for the print log tool can also be adversely affected if the log data sets reside on a magnetic tape device which has slower access than memory or hard disks. Further, based on the results of the log records' analysis, the customer must manually build the SQL statements to redo on the flagged data or index pages the inserts/updates/deletes that have occurred since the image copy was taken.
It is possible that, due to the volume of log records that must be analyzed, it is not feasible to manually build the SQL statements to redo the missing work. In cases like these the customers can sometimes choose other alternatives, such as restoring the data from a cloned system or even choosing to lose some of their data. Some customers have a log analysis tool to scan and analyze log records and build the SQL statements to redo the work that has been lost, by redoing the inserts/updates/deletes on flagged data or index pages that have occurred since the image copy was taken. However, in conventional systems the log analysis tool has to read and merge the log records from the active and archive log data sets, which could result in a prolonged run time. Moreover, a log analysis tool needs to start the log scan at a point of an image copy, as opposed to using the first log record that encountered an error, so more log records have to be read and analyzed than are actually needed, which also prolongs the run time.
The main drawback of the conventional recovery systems' approach, documented above, is the extended time period where the data from the flagged pages is not available due to the complexity of problem determination and the prolonged execution time of the tools run against the log data sets. In these systems the active and archive log data sets must be read multiple times, once during the recovery process and once for each data object that encountered an error, during the procedure to recover the data from unapplied, problematic log records.
Therefore, there is a need for a simple, optimized and method which can handle errors discovered during data recovery process more efficiently in order to minimize data outage time and data loss and which can automatically recover and restore the data objects from unapplied, problematic log records.