The present invention relates to providing atomicty of transactions on a database system, and in particular, to two-phase commits.
One of the long standing challenges in distributed computing has been to maintain data consistency across all of the nodes in a network. Perhaps nowhere is data consistency more important than in distributed database systems, where a distributed transaction may specify updates to related data residing on different database systems. To maintain data consistency, all changes made in all database systems by the distributed transaction must be either committed or, in the event of an error, xe2x80x9crolled backxe2x80x9d. When a transaction is committed, all of the changes to data specified by the transaction are made permanent. On the other hand, when a transaction is rolled back, all of the changes to data specified by the transaction already made are retracted or undone, as if the changes to the data were never made.
One approach for ensuring data consistency when processing distributed transactions is referred to as xe2x80x9ctwo-phase commitxe2x80x9d. According to the two-phase commit approach, one database system (the coordinating database system) is responsible for coordinating the commitment of the transaction on one or more other database systems. The other database systems that hold data affected by the transaction are referred to as participating database systems.
A two-phase commit involves two phases, the prepare phase and the commit phase. In the prepare phase, the transaction is prepared in each of the participating database systems. When a transaction is prepared on a database system, the database is put into such a state that it is guaranteed that modifications specified by the transaction to the database data can be committed. When all participants involved in a transaction are prepared, the prepared phase ends and the commit phase may begin.
In the commit phase, the coordinating database system commits the transaction on the coordinating database system and on the participating database systems. Specifically, the coordinating database system sends messages to the participants requesting that the participants commit the modifications specified by the transaction to data on the participating database systems. The participating database systems and the coordinating database system then commit the transaction. Finally, the participating database systems transmit a message acknowledging the commit to the coordinating database system.
On the other hand, if a participating database system is unable to prepare, or the coordinating database system is unable to commit, then at least one of the database systems is unable to make the changes specified by the transaction. In this case, all of the modifications at each of the participants and the coordinating database system are retracted, restoring each database system to its state prior to the changes.
The two-phase commit ensures data consistency while providing simultaneous processing of modifications to distributed databases. However, these benefits are not achieved without costs. Such costs include network traffic that occurs as a result of transmitting requests to prepare and commit from the coordinating database system to the participants, and of transmitting acknowledgements from the participants to the coordinating database system. Another cost is the increased latency experienced by a database system involved in a distributed transaction in waiting for other database systems to become prepared.
FIG. 1 shows a distributed database system used to illustrate in more detail the costs associated with the two-phase commit performed according to a conventional approach for performing a two-phase commit. Distributed database systems 100 includes a coordinating database system 110 and a participating database system 150. Database system 110 receives requests for data from database clients 120, which include client 122 and client 124. Such requests may be in the form of, for example, SQL statements.
Coordinating database system 110 includes a log, such as log 112. The log 112 is used to record modifications made to the database system, and other events affecting the status of those modifications, such as commits. Log 112 contains a variety of log records. When these log records are first created, initially they are stored in volatile memory, and are soon stored permanently to non-volatile storage (e.g. a non-volatile storage device such as a disk). Once log records are written to non-volatile storage, the modifications and other events specified by the log records are referred to as being xe2x80x9cpersistentxe2x80x9d. The modifications and events are xe2x80x9cpersistentxe2x80x9d because the permanently stored log records may be used, in the event of a system failure, after the failure to replay the modifications and events to restore the database to its pre-failure state.
For example, log 112 may contain redo records, which are used to record database operations such as INSERT, UPDATE, DELETE, CREATE, ALTER, or DROP. When a transaction modifies data in a database system, a redo record that specifies the modification is added to the log. To make the modifications permanent, a commit command is issued to database system 110. In response, database system 110 records the commit in a log record of log 112 referred to as a commit record. When a failure occurs after the redo records and the log record reflecting the commit are stored in non-volatile storage, the database may be modified based on the redo records.
FIG. 2 is a state diagram showing transaction states associated with a transaction as it progresses through the phases of a two-phase commit, and the steps that are performed before transitioning between various transaction states according to a conventional approach for performing a two-phase commit. The transaction states are illustrated using distributed database systems 100 as an example. Transaction states 201 are the transaction states that a transaction goes through within a coordinating database system (i.e. coordinating database system 110), and transaction states 202 are the transaction states a transaction goes through within a participating database system (i.e. participating database system 150).
Referring to FIG. 2, inactive states 210, 240, 250, 290 represent the inactive state of a transaction being processed on a distributed database system 200. In the inactive state, there are no database operations specified by the transaction that require any further action (e.g. commit, undo, locking or unlocking of resources needed to perform the operations, such as data blocks). On a given database system, a transaction is initially in the inactive state (i.e. inactive state 210 and 250), and eventually transitions to the inactive state upon completion (i.e. inactive states 240 and 290).
A transaction transitions from the inactive state to the active state when a database system receives a xe2x80x9cbegin transactionxe2x80x9d request. For example, client 122 (FIG. 1) may issue a BEGIN TRANSACTION request to database system 110. At step 212, database system 110 receives the begin transaction request and enters active state 220. Next, coordinating database system 110 receives a command to modify data on participating database system 150. In response, at step 221, coordinating database system 110 transmits a request to participating database system 250 to begin a transaction. At step 222, coordinating database system 110 transmits one or more requests to participating database system 150 to modify data on participating database system 150.
At step 252, participating database system 150 receives the request to begin a transaction. Relative to participating database system 150, the transaction enters the active state 260. Afterwards, participating database system 150 receives the request to modify data.
Once a transaction within a database system enters the active state, the database system may receive any number of requests to modify data as part of the transaction. For example, client 122 may issue requests to coordinating database system 110 to modify data on both coordinating database system 110 and participating database system 150. In response to receiving the requests to modify data on participating database system 150, coordinating database system 110 transmits requests to modify data on participating database system 150 to participating database system 150.
At step 223, the coordinating database system receives a request from client 122 to commit the transaction. In response, at step 224, coordinating database system 110 transmits a prepare request to participating database system 150. At step 262, participating database system 150 receives the request.
At step 264, participating database system 150 flushes log 152 (FIG. 1) to non-volatile storage. xe2x80x9cFlushing the logxe2x80x9d refers to causing the log records of the log that are currently only stored in volatile memory to be stored to non-volatile storage. Thus, flushing the log renders the modifications for participating database system 150 persistent. When the modifications are rendered persistent, participating database system 150 is able to guarantee that it can commit the transaction. Consequently, after step 264, the transaction enters the prepared state. At step 266, participating database system 150 records the transition to the prepared state in log 152 (i.e. creating a log record recording the fact the prepared state has been reached).
At step 272, participating database system 150 transmits a prepared acknowledgment to the coordinating database system 110. A prepared acknowledgment is a message sent by a participating database system that indicates whether or not the participating database system is prepared to commit the transaction. A participating database system is prepared to commit when the transaction is in the prepared state on the participating database system. At step 226, coordinating database system 110 receives the prepared acknowledgment.
At step 228, coordinating database system 110 commits and flushes the log 112. Specifically, coordinating database system 110 creates a log record in log 112 to record the commit. When coordinating database system 110 flushes the log, it renders the commit persistent. When a commit is persistent, the transaction is in the committed state. Thus, after flushing the log, coordinating database system 110 transitions to committed state 230.
After the transaction reaches the committed state, at step 232, coordinating database system 110 transmits to participating database system 110 a forget request. Next, database system 150 forgets the transaction. A forget request is a message sent to a participating database system requesting that the participating database system performing forget processing. Forget processing is the additional operations needed to transition a transaction from the prepared or committed state to the inactive state (e.g. commit the transaction, release resources, and render the transaction inactive).
At step 274, participating database system 150 receives the forget request. At step 276, participating database system commits (including creating a log record to record the commit), and then flushes log 152. At this stage, the transaction enters the inactive state on participating database system 150. At step 282, participating database system 150 releases any remaining locks on resources that were locked by participating database system 150 on behalf of the transaction. At step 284, participating database system 150 transmits a forget acknowledgement to coordinating database system 110. A forget acknowledgement is a message sent by a participating database system acknowledging that forget processing is completed on the participating database system.
At step 234, coordinating database system 110 receives the message acknowledging the forget. At step 236, coordinating database system 110 releases the locks on resources that were locked by coordinating database system 110 on behalf of the transaction. At this stage, the transaction enters the inactive state on coordinating database system 110.
The per transaction cost of the two-phase commit can be measured by the number of transmitted messages and log flushes that are attributable to performing the two-phase commit. Because four messages are attributable to the two-phase commit (i.e. step 221, step 232, step 272, and step 284), the per transaction cost in terms of messages is 4N, where N equals the number of participating database systems. Because one log flush for coordinating database system (i.e. step 228) and two log flushes for each participating database system are attributable to the two-phase commit, the cost in terms of log flushes is 2N+1, where N is the number of participating database systems.
Typically database systems transmit messages to each other through interprocess communication mechanisms, which may occur over a network. Inter-process communication mechanisms are expensive in terms of computer resources. Furthermore, the messages transmitted between the coordinating database system and the participating database systems are part of a handshaking scheme which may cause substantial delay to processing transactions. Specifically, after the coordinating database system transmits the prepare request to each participating database system, the prepare phase does not complete until each participating database system finishes preparing the transaction and transmits a prepared acknowledgement to the coordinating database system. Furthermore, after transmitting a forget request, the commit phase does not complete until each participating database system commits the transaction and transmits a forget acknowledgment to the participating database system. If any participating database system experiences a delay, every database system involved in the transaction experiences the delay.
Each log flush requires a write to non-volatile storage, a task which requires a relatively long period on the computer time scale. Thus, each log flush further contributes to the delay in completing the transaction. Finally, because resources locked for a transaction are not unlocked until the two-phase commit is complete, any delay increases the amount of time other processes will have to wait for those resources.
Based on the foregoing, it is clearly desirable to provide a method which reduces the number messages, handshaking, and log flushes required to complete a transaction under a two-phase commit.
A method and apparatus for performing a two-phase commit is described. According to an aspect of the present invention, a coordinating database system determines whether a particular participating database system is prepared to commit a transaction without transmitting a prepare request to the participating database system. For example, to determine whether a particular participating database system is prepared to commit, the coordinating database system examines external log tracking data that resides on the coordinating database system. External log tracking data, which indicates various states of logs on other database systems, is used to determine whether or not a particular participating database system is prepared to commit. Specifically, the external log tracking data indicates which log records have been written to non-volatile storage on participating database systems. The coordinating database system uses this data to determine whether all log records generated for a transaction that is being coordinated by the coordinating database system have been written to non-volatile storage at a particular participating database system. If all log records for a particular transaction have not been written to non-volatile storage at a participating database system, then the participating database system is not prepared to commit. Accordingly, the coordinating database system transmits a prepare request to the participating database system. If however, the coordinating database system has been able to determine that the participating database system is prepared based on the log tracking data stored at the coordinating database system, then there is no need to transmit a prepare request. The coordinating database system presumes that the participating database system is prepared to commit, and does not send a prepare request.