1. Technical Field
The present invention relates to the field of database management systems generally and, more particularly, to method and apparatus for detecting and recovering from data corruption of a database by codewording regions of the database and by logging information about reads of the database.
2. Description of the Related Arts
A database is a collection of data organized usefully and fundamental to many software applications. The database is associated with a database manager and together with its software application comprises a database management system (DBMS). In recent years, extensible database systems such as Illustra (now part of the Informix Universal Server) have been developed which allow the integration of application code with database system code. In these systems, the application code has direct access to the buffer cache and other internal structures of the DBMS. Similarly, application programs in many object oriented database (OODB) systems have direct access to an object cache in their address space. This OODB architecture was developed to minimize the cost of accesses to data, for example, to support the needs of Computer Aided Design (CAD) systems. Finally, several recently developed storage management systems provide memory resident or memory mapped architectures. For example, the Dali main-memory storage manager described in Bohannon et al., xe2x80x9cThe Architecture of the Dali Main-Memory Storage Manager,xe2x80x9d Multimedia Tools and Applications, 4, 115-151 (1997) is designed to provide applications with fast, direct access to data by keeping the entire database in volatile main memory. In all these systems, direct access to data (either in the database buffer cache or in a memory-mapped portion of the database) by application programs is critical to providing fast response times. The alternative to memory mapping is to access data via a server process, but this presents an unacceptable solution due to the high cost of inter-process communication. Application code is typically less trustworthy than database system code, and there is therefore a significant risk that xe2x80x9cwild writesxe2x80x9d and other programming errors can affect persistent data in systems that allow applications to access such data directly. Since the systems described above are increasingly popular, the risk of wild writes is growing. Additionally, there is a risk of damage due to software faults in the DBMS itself. It is therefore important to develop techniques that can mitigate the risk of corruption.
In our parent U.S. patent application Ser. No. 08/766,096, filed Dec. 16, 1996, now U.S. Pat. No. 5,845,292 and entitled xe2x80x9cSystem and Method for Restoring a Distributed Checkpointed Database,xe2x80x9d we describe the application of multiple checkpoints and the maintenance of a stable log record stored on a server for tracking operations to be made to the multiple checkpoints in a distributed environment. A companion parent application, U.S. patent application Ser. No. 08/767,048, now U.S. Pat. No. 5,845,849, entitled xe2x80x9cSystem and Method for Restoring a Multiple Checkpointed Database in View of Loss of Volatile Memoryxe2x80x9d filed the same day describes recovery processes at multiple levels of a DBMS in the event of loss of volatile memory. The ""048 and the ""096 applications should be deemed to be incorporated by reference herein as to their entire contents. Both of these applications relate to the preservation and restoration of a database (or distributed database), for example, stored in main volatile memory of a data processor.
The problem of detecting and recovering from corruption of data in a database system still remains to be solved in a pragmatic manner without adding considerable overhead to the DBMS. Data corruption may be physical or logical and it may be direct or indirect. Data is xe2x80x9cdirectlyxe2x80x9d corrupted by xe2x80x9cunintendedxe2x80x9d updates, such as wild writes as explained above due to programming errors in the physical case, or arising from incorrectly coded updates or input errors (human errors) in the logical case. Once data is directly corrupted, it may be read by a process, which then issues writes based on the value read. Data written in this manner is indirectly corrupted, and the process involved is said to have carried the corruption. While this process may be a database maintenance process, we focus on transaction-carried corruption, a problem in which the carrying process is executing transactions.
Direct physical corruption can be mostly prevented with hardware memory protection, using the virtual memory support provided by most operating system. One approach involves mapping the entire database in a protected mode, and selectively un-protecting and re-protecting pages as they are updated. However, this can be very expensive, for example, on standard UNIX systems. An alternative to the hardware approach would be programming language techniques such as type-safe languages or sandboxing. (Sandboxing is a technique whereby an assembly language programmer adds code immediately before a write to ensure that the instruction is not affecting protected space.) However, type-safe languages have yet to be proven in high-performance situations, and sandboxing may perform poorly on certain architectures. Finally, communication across process domain boundaries to a database server process provides protection, but such communication is orders of magnitude slower than access in the same process space, even with highly tuned implementations. The concern over physical corruption is further motivated by the increasing number of systems in which application code has direct access to system buffers, including extensible systems, object databases, and memory-mapped or in-memory architectures. Finally, some work has raised concern over damage to data due to faults in the DBMS itself.
Integrity constraints are widely studied and prevent certain cases of logical corruption in which rules about the data would be violated. However, it is an object of the present invention to deal with those cases in which integrity constraints and other input validation techniques fail, and whether due to programming error or invalid input, unintended updates are made to the database. We consider such cases inherently impossible to prevent, and instead assume that the problem is detected later, usually when a database user notices incorrect output (on a bank statement, for example).
Thus, there appears a genuine need in the art of database management systems to provide an improved method and apparatus for detecting and recovering from corruption of a database.
According to the present invention, it is a principle to apply several new techniques for the prevention or detection of corruption. In particular, a Read Prechecking scheme associates one word codewords with each region of data, and prevents transaction-carried corruption by verifying that the codeword matches the data each time it is read. A Data Codeword scheme, a less expensive variant of Read Prechecking, allows detection of direct physical corruption by asynchronously auditing the codewords. This scheme is also referred to herein as deferred codeword maintenance and involves performing codeword updates during a process called xe2x80x9clog flushingxe2x80x9d at the same time as data is flushed to disc from main memory.
For detecting indirect logical or physical corruption, it is a feature of the present invention to log information about reads (Read Logging). Interestingly, any negative impact of Read Logging is limited, as the actual values read are not logged according to one embodiment of the present invention, just the identity of the item read and optionally a checksum of the value. Moreover, it is an extension of the present invention to apply codewords as well to protect the read log records.
When corruption is detected rather than prevented, techniques for corruption recovery are employed to restore the database to an uncorrupted state. As will be further described herein, once codewording of data and read logging is performed, models and algorithms are presented for recovery from transaction-carried indirect corruption. One model, the redo-transaction model, uses logical descriptions of transactions to repair the database state. A second model, the delete-transaction model, focuses on removing the effects of corruption from the database image. The algorithms presented herein can be applied to recovery from logical or physical corruption. In addition, tracing techniques are presented which aid in determining the scope of logical corruption. Read logging and models for corruption recovery are claimed in copending U.S. patent application Ser. No., filed concurrently herewith entitled xe2x80x9cMethod and Apparatus for Detecting and Recovering from Data Corruption via Read Logging,xe2x80x9d of the same inventors (Attorney Docket No. Bohannon et al. 9-26-4-39-12).
To ascertain the performance of our algorithms for detecting and recovering from physical corruption, we have studied the impact of these schemes on a TPC-B style workload implemented in the Dali main-memory storage manager. Our goal was to evaluate the relative impact on normal processing of schemes that can be easily ported across a variety of architectures and operating systems. In addition to our schemes, we study a hardware-based protection technique. For detection of direct corruption, the overheads imposed cause throughput of update transactions to be decreased by 8%. Prevention of transaction-carried corruption with Read Prechecking costs between 12% and 72%, but requires a significant space overhead to achieve the better performance numbers. Detection of transaction-carried corruption with Read Logging costs between 17% and 22%. Our study indicates that the corruption prevention algorithms of Sullivan et al., xe2x80x9cUsing Write Protected Data Structures to Improve Fault Tolerance in Highly Available DBMSxe2x80x9d in Proceedings of the International Conference on Very Large Databases, pp 171-179, 1991, when using standard OS support for memory protection, decrease throughput by about 38%. Thus, the codeword and read logging based detection and prevention schemes of the present invention perform significantly better than the hardware-based protection.
With the present invention, it is possible to identity a subset of the later transactions that were (directly or indirectly) affected by the error, and to selectively roll them back and redo them manually (or even automatically in some cases). Also, the techniques of the present invention are language and instruction-set independent.
Thus, a method of detecting and recovering from data corruption of a database according to the present invention is characterized by the steps of protecting data of the database with codewords, the database having a plurality of regions, one codeword for each region of the database and verifying that a codeword matches associated data before the data is read from the database to prevent transaction-carried corruption. Also, according to another embodiment in which codewords are not immediately maintained, a method of detecting and recovering from data corruption of a database comprises the steps of protecting data of the database with codewords, one codeword for each region of the database, and asynchronously maintaining the codewords to improve concurrency of the database, for example, during a log flush to disc process. The invention also comprises a database so maintained and protected and associated recovery processes.
These and other features of the present invention will be best understood from considering the drawings and the following detailed description of the preferred embodiments.