1. Field of the Invention
This invention relates generally to the field of databases, and in particular to a method for ensuring the integrity of a database.
2. Background Art
A database is an ordered collection of data. A database system allows one or more data users, referred to as "clients" to add to, change, read from, delete, and/or otherwise manipulate the data of the database. A database management system is utilized to control the storage, retention and retrieval of data by clients in a database.
In a computer system, the database is often stored on a permanent storage system, such as a magnetic, optical, or magneto-optical disk drive. The term "permanent storage system" refers to a storage system that is used to retain data over long periods of time, in spite of power interruptions and some other failures. A disk dive is an example of such a permanent storage system. If data is stored in a nonvolatile memory, such as on a Winchester disk drive, and the disk drive is turned off or otherwise powered down and then turned back on, the data is still available. This is in contrast to temporary storage systems, such as most dynamic random access memory (DRAM). If data is stored in a typical DRAM system (without battery power), and the DRAM is turned off and then turned back on, the data is no longer available.
A client in a computer implemented database may be a human user, a processor, or a program executed on a processor. A client is any entity that can make a "transaction" with the database. A transaction is a sequence of operations that allow a client access to the database to read data, delete data, add new data, update or modify existing data. A transaction begins with an operation referred to as BEGIN operation and ends with either a COMMIT operation or a ROLLBACK operation. A COMMIT operation signifies the completion of a successful transaction. A ROLLBACK operation signifies the unsuccessful termination of a transaction.
In a typical computer-implemented database system, a transaction is not performed on the "original data" (i.e. the data stored on the permanent storage system), but is performed on a copy of the data. For example, when a client accesses a block of data from a database, a copy of the data is retrieved from the database and is stored in a temporary storage location, referred to as a "buffer". The client then may read, update, modify, replace or delete the block of data. The results of the transaction are effected on the original block of data at a later time. This is done because, typically, it is not possible to modify individual bytes in the permanent storage system; an entire block must be written at once. A typical database system utilizes a number of buffers to allow the transactions of multiple clients to be performed. As a result, there may be a number of transactions waiting to be executed on the original data at any one time. The delay in time between the execution of a transaction on data in a buffer and the execution of that transaction on the original data can affect the "integrity" of the database. The integrity of a database refers to its accuracy and reliability. A database has integrity if all outstanding client transactions have been executed in the correct order on the original data.
There are a number of factors that make it difficult to maintain the integrity of a database. First, when multiple clients access a database, a method to avoid harmful interaction among the clients must be provided. When multiple clients make changes to the database, the changes made by each client may be interrelated and must be applied to the database in the proper sequence across clients to preserve their relationship. Another factor that increases the difficulty of preserving the integrity of a database is that computer systems do not exhibit perfect reliability and components of a computer system occasionally fail, (e.g. the disk drive fails, the processor fails, there is a system failure, a power failure, etc.). Changes by clients to the database made before, during or after the failure must be applied to the database in the proper sequence to ensure database integrity.
When a database system failure occurs, a "recovery" operation is required. A recovery in a database system means recovering the database itself. That is, restoring the database to a state that is known to be consistent and reasonably recent. This is sometimes accomplished by restoring the database from a back-up copy of the contents of the database. The back-up copy is a duplicate copy of the original database that is made periodically.
A disadvantage of recovering a database from a backup copy is that the backup copy may not reflect transactions that occurred after the time the backup was made and before the failure of the database system. It is desired to be able to recreate those missing transactions to avoid the duplication of work that has already been performed on the database. This requires that the transactions be "persistent". Persistence refers to the retention of sufficient redundant information about a transaction to recreate the effects of the transaction at a later time, in the event that a failure causes loss of the primary copy of the data written by the transaction.
One prior art method used to ensure the persistence of changes involves sequentially writing short records containing information sufficient to redo the changes to a "log." The log provides a record of changes that can survive a database system failure. Thus, the persistence of the changes is provided.
A database system that uses a log to store redo records is illustrated in FIG. 1. The database system of FIG. 1 consists of a plurality of clients 101, 102 and 103 coupled to a database controller 104. The database controller includes a plurality of cache buffers 107-112, and a log buffer 113. The database controller is coupled to permanent storage devices 105 and 106. In the example of FIG. 1, storage device 105 stores log information and storage device 106 stores the original database data.
When a client reads information, a BEGIN operation is executed, the information is located in database 106 and transferred to one of cache buffers 107, 108, 109, 110, 111 or 112. The information is provided to the appropriate client from the proper cache buffer. When the client has read the information, a COMMIT operation is executed and the cache buffer is released for use by another client. No changes are recorded in database 106 since a read operation does not alter the original data. Correspondingly, a log entry is not required because there is no need to recreate a read transaction in the event of a failure.
When a client writes information, the information is first written to cache buffer 107, 108, 109, 110, 111 or 112. When all of the information to be added has been entered into the cache buffer, a log entry is produced and written to log buffer 113. When the client executes a COMMIT operation, the log entry is written to log 105. At some later time, an attempt is made to add the information to database 106. If this attempt fails, (or if some earlier failure prevents the attempt from taking place), the log entry in log 105 can be used as a redo record to "redo" the failed transaction.
When a client changes or deletes information in the database, the record containing the information to be changed is first located in database 106. The record is copied into a cache buffer 107, 108, 109, 110, 111 or 112. The client makes appropriate changes to the cache buffer containing the information. When all of the changes have been made to the information in the cache buffer, a log entry is produced and placed in log buffer 113. After a COMMIT operation, the log entry is then written to log 105. An attempt is made to write the updated record back to database 106 to replace the old record. If this attempt fails, (or if some earlier failure prevents the attempt from taking place), the log entry in log 105 is used as a redo record to "redo" the failed transaction.
The log system of FIG 1 has the disadvantage that, in a high-performance system, the log can become a bottleneck. The size of the redo records can exceed the transfer rate capability of the device on which the log is stored. The performance of the system cannot be increased without increasing the transfer rate of the storage device.
Consider the example when a transaction log is kept on a disk drive. Assuming a rotation rate of approximately 4,000 RPM, positioning time of 3.5 to 32 milliseconds, 56 sectors per track and 512 bytes per sector, the optimal throughput of this disk drive is between 610 to 1,550 kilobytes per second supporting a transaction rate of 508 to 1,292 transactions per second (assuming that each transaction requires approximately 1.2 kilobytes of log information). In a high performance system, such as a parallel processor system, tens of thousands of transactions per second can be achieved. In such a system, a single log device is the limiting factor in performance.
One prior art attempt to overcome the transfer rate limitation is to provide multiple logs, each stored on a different device. The aggregate transfer rate of all logs is greater than the transfer rate of a single device. The division of the log into multiple components is typically accomplished in the following manner. The database is partitioned and a separate log is maintained for each partition. As a result, all records for a single partition are contained in one log. However, performance can still be limited when a number of clients are writing into the same partition, resulting in contention bottlenecks. In addition, a client that writes to several partitions must correspondingly write to several logs, adding complexity and consuming time.
A system implementing a partitioned database with a separate log for each partition is illustrated in FIG. 2. A plurality of clients 201-206 are coupled to a database router 207, which is in turn coupled to database controllers 236 and 237. The single database 106 of FIG. 1 is divided or "partitioned" into two separate database partitions 211 and 212 in the example of FIG. 2, which are managed by database controllers 236 and 237, respectively. These partitions may be actual physical partitions where separate disk drives, for example, are used to store different sections of the database. Alternatively, the partition may be logical, based on address locations.
Each partition 211 and 212 has associated cache buffers and log buffers within database controllers 236 and 237, respectively. Database controller 236 includes cache buffers 214-219 and log buffer 232. Log buffer 232 communicates with permanent log storage 208 and database partition 211. Database controller 237 includes cache buffers 220-225 and log buffer 233. Log buffer 233 communicates with permanent log storage 209 and database partition 212.
When a client reads information, the request is routed by database router 207 to one of the database controllers 236 or 237, and the information is located in one of database partitions 211 or 212. The information is then transferred to one of the cache buffers associated with the particular database partition. For example, if client 203 reads a record from partition 213, a copy of the record is stored in one of cache buffers 226-231. The information is provided to client 203 from the cache buffer. When the client has read the information, the cache buffer is released to the system. No log entry is required since a read operation does not alter the information.
Consider the case when client 202 writes information to partition 211. Client 202 writes the desired data to one of the cache buffers 214-219 in buffer partition 236 (associated with partition 211). After the data to be written is entered into a cache buffer, a log entry is created and stored in log buffer 232. After a COMMIT operation is executed by the client 202, the log entry in log buffer 232 is written to the log storage 208 associated with that partition. At a later time, the contents of the cache buffer are written to the appropriate location in database partition 211. If there is a system failure prior to the entry of the data into the database partition 211, the log entry stored in log 208 can be used to redo the write operation.
Now consider the example of when client 202 modifies or deletes data from database partition 212. The record to be modified or deleted is located in database partition 212 and a copy is written to one of cache buffers 220-225 in partition 237 (associated with database partition 212). Client 202 executes the appropriate modifications or deletions on the data in the cache buffer and a log entry is written to log buffer 233. When client 202 executes a COMMIT instruction, the log entry in log buffer 233 is written to permanent log storage 209. Subsequently, the modifications or deletions are affected on the original data stored in database partition 212.
FIG. 2 shows a design in which each partition 211 or 212 is managed by a separate database controller 236 or 237, respectively. However, the same principles apply if all partitions are managed by the same database controller. In this case, specific cache buffers may be set aside for each partition; or, alternatively, a common set of cache buffers may be used for blocks in all partitions.
There are a number of disadvantages associated with the partition-based log system of FIG. 2. Because, in a series of operations, a client may write to one or more partitions, the log entries associated with transactions for that client are stored in different logs. To ensure data integrity, coordination is necessary to ensure that the changes made by a given client in different logs are executed together. Another disadvantage occurs when a plurality of clients are writing to a single partition. In this situation, a bottleneck can occur as in the log system of FIG. 1.