1. FIELD OF THE INVENTION
This invention relates to the field of distributed databases.
2. BACKGROUND ART
A database is an ordered collection of data. A database system allows one or more data users, referred to as "clients" to add to, change, read from, delete, and/or otherwise manipulate the data of the database. A database management system is utilized to control the storage, retention and retrieval of data by clients in a database.
In a computer system, the database is often stored on a permanent storage system, such as a magnetic, optical, or magneto-optical disk drive. The term "permanent storage system" refers to a storage system that is used to retain data over long periods of time, in spite of power interruptions and some other failures. A disk dive is an example of such a permanent storage system. If data is stored in a nonvolatile memory, such as on a Winchester disk drive, and the disk drive is turned off or otherwise powered down and then turned back on, the data is still available. This is in contrast to temporary storage systems, such as most dynamic random access memory (DRAM). If data is stored in a typical DRAM system (without battery power), and the DRAM is turned off and then turned back on, the data is no longer available.
A client in a computer implemented database may be a human user, a processor, or a program executed on a processor. A client is any entity that can make a "transaction" with the database. A transaction is a sequence of operations that allow a client access to the database to read data, delete data, add new data, update or modify existing data. A transaction begins with an operation referred to as BEGIN operation and ends with either a COMMIT operation or a ROLLBACK operation. A COMMIT operation signifies the completion of a successful transaction. A ROLLBACK operation signifies the unsuccessful termination of a transaction.
It is desired for a database system to provide "consistency", "concurrency", "atomicity", and "durability". Consistency is the state in which two or more values in a database that are required to be in agreement with each other, are in fact in agreement. When transactions are executed one at a time, consistency is preserved. Concurrency is the state in which the sequence of execution of transactions, even if the transactions are executed at the same time, (such that various statements from different transactions are executed in an interleaved fashion), the database system controls the execution so that consistency is preserved. Atomicity of a transaction is when all of the statements of a transaction take effect or none of them do. Durability means that the effects of a transaction must persist across failures of the system components.
To provide data consistency during write and read operations, a method of "locking" the database to prevent other transactions is utilized. In a single database, one type of locking is referred to as "share locks." A share lock locks a block of data that is being accessed until it is no longer being accessed, (the lock may be released before the end of a transaction). For example, during a read operation, the data is locked so that no other client can write to that data. Such a locking scheme limits activity on the database, but provides for low concurrency and high consistency. Another method is known as "release share lock." In a release share lock scheme, the data is locked when a read data operation is initiated. After the data item has been read, the transaction ending, the lock is removed. A third scheme is referred to as "not get share locks." In this scheme, the current contents of a database are read and it's possible that a transaction might see uncommitted data. Therefore, a transaction cannot be confident of the accuracy of the data that is being read.
Another scheme is known as "exclusive locks". Exclusive locks are acquired when data is updated. The following is a matrix that describes lock compatibility.
______________________________________ Data locked Data locked in Share in Exclusive ______________________________________ Want share lock OK Wait Want exclusive lock Wait Wait ______________________________________
Another scheme, utilized by the assignees of the present invention, does not utilize share locks, yet still provides a consistent version of the database during read operations. The scheme provides more correct data than the release share lock and not get share lock schemes without sacrificing concurrency. This scheme permits a read operation to see only committed data from other transactions, and any uncommitted changes made by the transaction containing the read operation. That is, the transaction sees a "snapshot" of the database as of a certain point in time. This is accomplished by implementing "system commit numbers." Each time a transaction is committed it is assigned a "system commit number." A system commit number is a logical value that increases with time. The system commit number is incremented or advanced each time a transaction commits, so that it reflects the logical order of the transactions progressing the database from one state to another. Before each statement in a transaction is executed, the current system commit number is saved and used to define a transaction-consistent state for all reads within the statement, so that even as other transactions are modifying the database and committing concurrently, changes made by other transactions during a statement's execution are not seen. When a transaction is explicitly declared read-only, the current system commit number is saved and used to define a transaction-consistent state for all reads within the transaction. As noted, this prevents a transaction from seeing data that is not committed (i.e., potentially changed or false data). It also prevents reads within a statement or an explicit read-only transaction from seeing changes that were committed after the statement or read only transaction started. This scheme only requires a wait when a transaction is attempting to update a data block but another transaction already has an exclusive lock.
The use of the system commit number scheme is effective in nondistributed database systems, that is, a database with a single data resource. However, it has not been effective in the case of a distributed database. In a distributed database, there are a network of sites, each containing a data source. The sites can each be under control of a separate resource managers. The entire collection of data at the multiple sites can be treated as a single distributed database. It is possible to have a transaction that can update any number of the distributed databases and can commit the transaction atomically. Such a transaction is called a "distributed transaction".
In a distributed database, in an implementation of the assignee of the present invention, a two-phase commit scheme is utilized for distributed transactions. In the first phase of a two-phase commit scheme, all databases surrender autonomy for a short period of time to hold resources necessary to commit or roll back a transaction as required. In the first phase, the various databases promise to commit or roll back when commanded by a master database. The second phase of the two-phase commit is the actual commit step. Each data source assigns its own local system commit number to its portion of the distributed transaction.
Because system commit numbers are established locally, it has not been possible to implement the read consistent scheme of a non-distributed database in a distributed database environment. This is because it is not meaningful to compare the system commit number of a read operation to a plurality of different system commit numbers at each database. It is possible that committed data in one database may have a system commit number higher than the system commit number of the read operation. In that case, the transaction is provided with old data instead of the most currently available data at the time the read operation was initiated.
Shipley, U.S. Pat. No. 4,819,159, describes a distributed, multi-process on-line transaction processing system employing multiple concurrent processors communicating by conventional LAN links. In one embodiment, Shipley provides fault tolerance to the system. The system is transaction based, where each transaction is treated atomically. The atomicity of transactions is ensured by establishing a transaction coordinator, which maintains a log of the distributed file access required during processing of a transaction, combined with file and block level locks to prevent other transactions from altering the data at inappropriate times during processing. During processing, a consistent view of all required files is maintained.
The commit process of Shipley implements a two-phase commit during a disk write. During phase one, the transaction coordinator sends commit messages to each file system involved in the transaction, and corresponding I/O servers send acknowledge signals to the coordinator. In phase two, the transaction coordinator writes to the log, committing the transaction. Shipley does not log prepare and commit times for each transaction to ensure distributed read consistency. Additionally, Shipley does not disclose row level locking capable of writing locking information as the transaction proceeds, and is not applicable to read and write operations.
U.S. Pat. No. 4,569,015 to Dolev provides a method for achieving Byzantine Agreement among active network processors to execute atomically a task distributed among them even in the presence of detected faulty processors. The Byzantine Agreement method of Dolev is applied to a two-phase commit protocol in one embodiment. The two-phase commit requires the following steps:
(1) designating one node as a transaction coordinator and broadcasting a "prepare-to-commit" at time t message to all participating processors; PA1 (2) each processor responding to this message by either logging a "prepared" record and voting "yes," or aborting and voting "no;" PA1 (3) broadcasting the event "commit" or "abort" using the inventive method if votes received by the transaction coordinator by time t+2o so dictate; and PA1 (4) aborts if a processor has not decided to commit by time t+6o.
Dolev does not log prepare and commit and start times for each transaction, and does not use row level locking. Further, Dolev is directed toward ensuring consistency after a fault and not to a system for providing distributed read consistency.
Thompson, U.S. Pat. No. 4,881,166, discloses a concurrence control that ensures the correct execution of multiple concurrent global transactions in a distributed database system along with independent concurrent execution of local transactions at each site. Thompson uses a two-phase commit protocol between the servers and the local databases to commit the updates performed by the global transactions on the global database. Thompson monitors possible inconsistency conditions between transactions, and prevents a transaction from executing as long as an inconsistency or deadlock is possible.
If the end time for transaction T does not fall between the start and end time for any other involved transaction, then transaction T is allowed to execute, since no possibility of a deadlock or inconsistency may occur. Instead of logging the prepare and commit times for each transaction, only the start and end times of each transaction are logged. Although the method of Thompson does ensure read and write consistency in a distributed database, Thompson does not disclose any method of locking, and does not disclose row level locking capable of writing locking information as the transaction proceeds.
Ecklund, U.S. Pat. No. 4,853,843, presents a system for merging virtual partitions on an objected-oriented, distributed database system following failure between sites accessing the database. Following restoration of site communication, the virtual partitions are merged to form a consistent merged database. Ecklund does not log prepare, commit and start times for each transaction, and does not disclose any method of locking data to ensure distributed read consistency.
U.S. Pat. No. 4,949,251 to Griffin discloses a method for ensuring a transaction occurs atomically on a distributed system, even where there is a partial system failure during the transaction. Every transaction is assigned a unique identification number and placed on a task queue. The I.D. number is made up of the current logical time concatenated with the processor identification number. As each transaction is performed the transaction's task I.D. is written into each updated database. During recovery from a partial system failure, Griffin compares the current task I.D. to the pre-existing task I.D.'s written into each database to see if the database has already performed the current task. If so, Griffin aborts the current transaction without performing further updates. In this way, every transaction is completed exactly once even after a system failure. Although Griffin assigns I.D.'s to each transaction, and these I.D.'s include the start times of each transaction, Griffin does not log prepare and commit times of the two-phase commit associated with the transaction. Further, Griffin does not disclose any method of locking to ensure data consistency during a distributed read.
U.S. Pat. No. 4,868,166 to Reinsch describes a method of restarting a fault-tolerant system without requiring the writing of images of loaded records to the log. Instead, the method of Reinsch logs only a minimal amount of information, recording positions within data sets to be moved and within the tablespace to be loaded. Reinsch is directed toward improving the load operation during a system recovery. Although Reinsch does disclose a two-phase commit, it does not disclose a method of accomplishing the two-phase commit, and does not provide a method for ensuring distributed read consistency.
Jenner, U.S. Pat. No. 4,648,031, is directed toward a method for restarting a failed computer subsystem, or a subset of that subsystem. In Jenner, the invention keeps track of all work to be done upon restart and all available system resources, and selectively restarts certain resources while deferring the restarting of other resources until they become available. Recovery is based on check-pointed information recorded on a recovery log. The method of Jenner does not disclose a method of accomplishing a two-phase commit, and does not provide a method for ensuring distributed read consistency or row level locking.
Daniell, U.S. Pat. No. 4,620,276, discloses a method for asynchronously processing replication messages between nodes in a distributed multiprocessor system. In the Daniell invention messages received at each node are processed normally, or else discarded, in the order of their receipt at the node. Daniell is designed to take the place of synchronous message processing protocols, such as the two-phase commit. In fact, Daniell states that, using the method described, no two-phase commit need be implemented to assure consistency between paired nodes.
Four concurrency control protocols are described in Weihl, "Distributed Version Management for Read-Only Actions", IEEE Transactions on Software Engineering, Vol. SE-13, No. 1, January 1987. The protocols work by maintaining multiple versions of the system state. Read only actions read old versions of the data, while update actions manipulate the most recent version. The scheme of Weihl uses read locks on data that is read within an update transaction. In addition, Weihl requires complicated algorithms for tracking old versions of data.