Multi-processing computer systems typically fall into three categories: shared everything systems, shared disk systems, and shared-nothing systems. In shared everything systems, processes on all processors have direct access to all volatile memory devices (hereinafter generally referred to as “memory”) and to all non-volatile memory devices (hereinafter generally referred to as “disks”) in the system. Consequently, a high degree of wiring between the various computer components is required to provide shared everything functionality. In addition, there are scalability limits to shared everything architectures.
In shared disk systems, processors and memories are grouped into nodes. Each node in a shared disk system may itself constitute a shared everything system that includes multiple processors and multiple memories. Processes on all processors can access all disks in the system, but only the processes on processors that belong to a particular node can directly access the memory within the particular node. Shared disk systems generally require less wiring than shared everything systems. Shared disk systems also adapt easily to unbalanced workload conditions because all nodes can access all data. However, shared disk systems are susceptible to coherence overhead. For example, if a first node has modified data and a second node wants to read or modify the same data, then various steps may have to be taken to ensure that the correct version of the data is provided to the second node.
In shared-nothing systems, all processors, memories and disks are grouped into nodes. In shared-nothing systems as in shared disk systems, each node may itself constitute a shared everything system or a shared disk system. Only the processes running on a particular node can directly access the memories and disks within the particular node. Of the three general types of multi-processing systems, shared-nothing systems typically require the least amount of wiring between the various system components. However, shared-nothing systems are the most susceptible to unbalanced workload conditions. For example, all of the data to be accessed during a particular task may reside on the disks of a particular node. Consequently, only processes running within that node can be used to perform the work granule, even though processes on other nodes remain idle.
Databases that run on multi-node systems typically fall into two categories: shared disk databases and shared-nothing databases.
Shared Disk Databases
A shared disk database coordinates work based on the assumption that all data managed by the database system is visible to all processing nodes that are available to the database system. Consequently, in a shared disk database, the server may assign any work to a process on any node, regardless of the location of the disk that contains the data that will be accessed during the work.
Because all nodes have access to the same data, and each node has its own private cache, numerous versions of the same data item may reside in the caches of any number of the many nodes. Unfortunately, this means that when one node requires a particular version of a particular data item, the node must coordinate with the other nodes to have the particular version of the data item shipped to the requesting node. Thus, shared disk databases are said to operate on the concept of “data shipping,” where data must be shipped to the node that has been assigned to work on the data.
Such data shipping requests may result in “pings”. Specifically, a ping occurs when a copy of a data item that is needed by one node resides in the cache of another node. A ping may require the data item to be written to disk, and then read from disk. Performance of the disk operations necessitated by pings can significantly reduce the performance of the database system.
Shared disk databases may be run on both shared-nothing and shared disk computer systems. To run a shared disk database on a shared-nothing computer system, software support may be added to the operating system or additional hardware may be provided to allow processes to have access to remote disks.
Shared-Nothing Databases
A shared-nothing database assumes that a process can only access data if the data is contained on a disk that belongs to the same node as the process. Consequently, if a particular node wants an operation to be performed on a data item that is owned by another node, the particular node must send a request to the other node for the other node to perform the operation. Thus, instead of shipping the data between nodes, shared-nothing databases are said to perform “function shipping”.
Because any given piece of data is owned by only one node, only the one node (the “owner” of the data) will ever have a copy of the data in its cache. Consequently, there is no need for the type of cache coherency mechanism that is required in shared disk database systems. Further, shared-nothing systems do not suffer the performance penalties associated with pings, since a node that owns a data item will not be asked to save a cached version of the data item to disk so that another node could then load the data item into its cache.
Shared-nothing databases may be run on both shared disk and shared-nothing multi-processing systems. To run a shared-nothing database on a shared disk machine, a mechanism may be provided for partitioning the database, and assigning ownership of each partition to a particular node.
The fact that only the owning node may operate on a piece of data means that the workload in a shared-nothing database may become severely unbalanced. For example, in a system of ten nodes, 90% of all work requests may involve data that is owned by one of the nodes. Consequently, the one node is overworked and the computational resources of the other nodes are underutilized. To “rebalance” the workload, a shared-nothing database may be taken offline, and the data (and ownership thereof) may be redistributed among the nodes. However, this process involves moving potentially huge amounts of data, and may only temporarily solve the workload skew.
Distributed Transactions in Shared Nothing Database Systems
A distributed transaction may specify updates to data items that reside on different nodes in a shared-nothing database system. For example, a distributed transaction may specify an update to a first piece of data owned by a first shared-nothing node, and an update to a second piece of data owned by a second shared-nothing node. The nodes that own data that is involved in a distributed transaction are referred to herein as “participating” nodes or simply “participants”.
To maintain data consistency, the distributed transaction must be either committed or, in the event of an error, “rolled back”. When a transaction is committed, all of the changes to data specified by the transaction are made permanent. On the other hand, when a transaction is rolled back, all of the changes to data specified by the transaction that have already been made are retracted or undone, as if the changes to the data were never made. Thus, the database is left in a state that either reflects all of the changes specified in the transaction, or none of the changes specified in the transaction.
Two-Phase Commit
One approach for ensuring data consistency during distributed transactions involves processing distributed transactions using a two-phase commit protocol. Two-phase commit is described in detail, for example, in U.S. Pat. No. 6,493,726, entitled “Performing 2-Phase Commit With Delayed Forget”. In general, two-phase commit requires that the transaction first be “prepared”, and then committed. Prior to the prepared phase, the changes specified by the transaction are made at each of the participating shared-nothing nodes. When a participating node completes all requested operations, the participating node forces the changes, and a “prepare” record, to persistent storage. The participant then reports to the coordinator that the participant is in the “prepared” state. If all of the participants successfully enter the prepared state, then the coordinator forces a commit record to persistent storage. On the other hand, if any errors occur prior to the prepared state, indicating that at least one of the participating nodes could not make the changes specified by the transaction, then all of the changes at each of the participating nodes are retracted, restoring each participating database system to its state prior to the changes.
FIG. 1 shows a multi-node shared-nothing database system used to illustrate in more detail the costs associated with a conventional approach for performing a two-phase commit. Multi-node database system 100 includes a coordinating node 110 and a participating node 150. Coordinating node 110 receives requests for data from database clients 120, which include client 122 and client 124. Such requests may be in the form of, for example, SQL statements.
Coordinating node 110 includes a log, such as log 112. The log 112 is used to record modifications made to the database system, and other events affecting the status of those modifications, such as commits. Log 112 contains a variety of log records. When these log records are first created, initially they are stored in volatile memory, and are soon stored permanently to non-volatile storage (e.g. a non-volatile storage device such as a disk). Once log records are written to non-volatile storage, the modifications and other events specified by the log records are referred to as being “persistent”. The modifications and events are “persistent” because the permanently stored log records may be used, in the event of a system failure, after the failure to replay the modifications and events to restore the database to its pre-failure state.
FIG. 2 is a flowchart showing the interaction between a coordinator and a participant according to a conventional approach for performing a two-phase commit. The transaction states are illustrated using multi-node database system 100 as an example. Transaction states 201 are the transaction states that a transaction goes through within a coordinating database system (i.e. coordinating node 110), and transaction states 202 are the transaction states a transaction goes through within a participating database system (i.e. participating node 150).
Referring to FIG. 2, inactive states 210, 240, 250, 290 represent the inactive state of a transaction. In the inactive state, there are no database operations specified by the transaction that require any further action (e.g. commit, undo, locking or unlocking of resources needed to perform the operations, such as data blocks). A transaction is initially in the inactive state (i.e. inactive state 210 and 250), and upon completion transitions back to the inactive state (i.e. inactive states 240 and 290).
A transaction transitions from the inactive state to the active state when a database system receives a “begin transaction” request. For example, client 122 (FIG. 1) may issue a BEGIN TRANSACTION request to coordinating node 110. Alternatively, the “begin transaction” command may be implicit. For example, a database server may begin an active transaction upon receiving a statement that specifies an operation or change. At step 212, coordinating node 110 receives the begin transaction request and enters active state 220. Next, coordinating node 110 receives a command to modify data on participating node 150. In response, at step 221, coordinating node 110 transmits a request to participating node 150 to begin a transaction. At step 222, coordinating node 110 transmits one or more requests to participating node 150 to modify data on participating node 150.
At step 252, participating node 150 receives the request to begin a transaction. Relative to participating node 150, the transaction enters the active state 260. Afterwards, participating node 150 receives the request to modify data.
Once a transaction within a database system enters the active state, the database system may receive any number of requests to modify data as part of the transaction. For example, client 122 may issue requests to coordinating node 110 to modify data on both coordinating node 110 and participating node 150. In response to receiving the requests to modify data on participating node 150, coordinating node 110 transmits requests to modify data on participating node 150 to participating node 150.
At step 223, the coordinating database system receives a request from client 122 to commit the transaction. In response, at step 224, coordinating node 110 transmits a prepare request to participating node 150. At step 262, participating node 150 receives the request.
At step 264, participating node 150 flushes log 152 (FIG. 1) to non-volatile storage. “Flushing the log” refers to causing the log records of the log that are currently only stored in volatile memory to be stored to non-volatile storage. Thus, flushing the log renders the modifications for participating node 150 persistent. When the modifications are rendered persistent, participating node 150 is able to guarantee that it can commit its portion of the transaction. Consequently, after step 264, the transaction enters the prepared state. At step 266, participating node 150 records the transition to the prepared state in log 152 (i.e. stores on disk a log record that records the fact the prepared state has been reached).
At step 272, participating node 150 transmits a prepared acknowledgment to the coordinating node 110. A prepared acknowledgment is a message sent by a participating database system that indicates whether or not the participating database system is prepared to commit the transaction. A participating database system is prepared to commit when the transaction is in the prepared state on the participating database system. At step 226, coordinating node 110 receives the prepared acknowledgment.
At step 228, coordinating node 110 commits and flushes the log 112. Specifically, coordinating node 110 creates a log record in log 112 to record the commit. When coordinating node 110 flushes the log, it renders the commit persistent. When a commit is persistent, the transaction is in the committed state. Thus, after flushing the log, coordinating node 110 transitions to committed state 230.
After the transaction reaches the committed state, at step 232, coordinating node 110 transmits to participating coordinating node 110 a forget request. Next, participating node 150 forgets the transaction. A forget request is a message sent to a participating database system requesting that the participating database system performing forget processing. “Forget processing” generally refers to the additional operations needed to transition a transaction from the prepared or committed state to the inactive state (e.g. commit the transaction, release resources, and render the transaction inactive).
At step 274, participating node 150 receives the forget request. At step 276, participating database system commits (including creating a log record to record the commit), and then flushes log 152. At this stage, the transaction enters the inactive state on participating node 150. At step 282, participating node 150 releases any remaining locks on resources that were locked by participating node 150 on behalf of the transaction. At step 284, participating node 150 transmits a forget acknowledgement to coordinating node 110. A forget acknowledgement is a message sent by a participating node acknowledging that forget processing is completed on the participating node.
At step 234, coordinating node 110 receives the message acknowledging the completion of forget processing. At step 236, coordinating node 110 can delete state information maintained by the coordinator on behalf of the transaction. Such state information may include, for example, a list of the participants in the distributed transaction. At this stage, the transaction enters the inactive state on coordinating node 110.
The per transaction cost of the two-phase commit can be measured by the number of transmitted messages and log flushes that are attributable to performing the two-phase commit. Because four messages are attributable to the two-phase commit (i.e. step 221, step 232, step 272, and step 284), the per transaction cost in terms of messages is 4N, where N equals the number of participating nodes. Because one log flush for the coordinating node (i.e. step 228) and two log flushes for each participating node are attributable to the two-phase commit, the cost in terms of log flushes is 2N+1, where N is the number of participating nodes.
Based on the foregoing, it is clearly desirable to provide techniques to reduce the number messages, handshaking, and log flushes required to complete a transaction that involves multiple shared-nothing nodes.