Computerized databases are commonly used to store large amounts of data for easy access and manipulation by multiple users. In a centralized computer system, there is a single copy of the data stored at one location, typically a computer. By maintaining a single, centralized database, such a system avoids inconsistencies which might otherwise occur with more than one copy of the data. Nevertheless, the centralized database approach has several drawbacks. First, since only one copy of the data exists, if the data becomes corrupted or inaccessible, the entire system becomes unavailable. Second, with only one copy of data available for read and update purposes, the system may appear slow and time-consuming, especially to multiple users.
Consequently, many of today's organizations, especially those dispersed over several locations, utilize some type of distributed database system. In a distributed system, an organization's data is spread across the storage facilities of several computers or processors. These storage facilities may be located throughout a single building, across several adjacent buildings or at different locations across the country or around the world. These computers or processors are interconnected via a communications network and are referred to as sites or nodes. Each site, moreover, is able to process local transactions which access data retained only at that local storage facility as well as distributed transactions which access data stored on more than one computer.
Computerized databases, both centralized and distributed, are often used to execute transactions. A transaction is a set of data-dependent operations requested by a user of the system. For example, a user may request some combination of retrieval, update, deletion or insertion operations. The completion of a transaction is called a commitment and the cancellation of a transaction prior to its completion is referred to as an abort. If a transaction is aborted, then any partial results (i.e., updates from those operations that were performed prior to the abort decision) must be undone. This process of returning the data items to their original values is also referred to as a roll back. An important aspect of a transaction is atomicity. Atomicity means that all of the operations associated with a particular transaction must be performed or none of them can be performed. That is, if a transaction is interrupted by a failure, the transaction must be aborted so that its partial results are undone (i.e., rolled back) and, if the transaction is completed, the results are preserved (i.e., committed) despite subsequent failures. The classic example of atomicity concerns a transfer of bank funds from account A to account B. Clearly, the system must either perform both the withdrawal and the deposit operations of the transaction or neither operation.
To protect against disruptions caused by the failure of any particular site, most distributed database systems allow additional copies or “replicas” of the data to be made at other sites. That is, a copy of each data item stored on one of the system's database facilities may also exist at the database facilities of other sites. By replicating the data across multiple instances of database facilities, a certain degree of fault-tolerance may be obtained. Furthermore, by having a locally available replica of the database available, the response time of certain transactions may be improved.
Although replicated systems provide the above advantages over non-replicated systems, there are nonetheless inherent costs associated with the replication of databases. To update a single data item, at least one message must be propagated to every replica of that data item, consuming substantial communications resources. Furthermore, in order to manage multiple databases and handle the execution of concurrent transactions, a complicated administrative support mechanism is required. In addition, if the replicated system cannot guarantee consistent updates at all replicas, data integrity may be compromised.
Most commercially available replicated database systems utilize either a distributed transaction approach or a primary-backup approach to replicate the data. In the distributed transaction approach, all database replicas are updated with a single, distributed transaction. That is, whenever a data item is updated by a transaction, all copies or replicas of that data item are updated as part of the same transaction. This approach results in completely synchronized replicas. To ensure atomicity, distributed transaction-based systems must employ an atomic commit protocol, such as the well-known 2 Phase Commit (“2PC”) protocol. The basic idea behind 2PC is to determine a unique decision for all replicas with respect to either committing or aborting a transaction and then executing that decision at all replicas. If a single replica is unable to commit, then the transaction must be aborted at all replicas.
More specifically, under the 2PC protocol, a single database manager associated with a single database facility is chosen as the coordinator of the transaction. The coordinator first asks all of the participants (i.e., the other replicas) including itself (if the coordinator is a participant) to prepare for the commitment of a transaction. Each participant replies to the coordinator with either a READY message, signaling that the participant is ready and willing to commit the transaction, or an ABORT message, signaling that the participant is unable to commit the transaction. Before sending the first prepare message, the coordinator typically enters a record in a log stored on stable storage, identifying all of the replicas participating in the transaction. The coordinator also activates a time-out mechanism. Based on the replies received from the participants, the coordinator decides whether to commit or abort the transaction. If all participants answer READY, the coordinator decides to commit the transaction. Otherwise, if at least one participant replies with an ABORT message or has not yet answered when the time-out expires, the coordinator decides to abort the transaction.
The coordinator begins the second phase of 2PC by recording its decision (i.e., commit or abort) in the log. The coordinator then informs all of the participants, including itself, of its decision by sending them a command message, i.e., COMMIT or ABORT. In response, all of the participants write a commit or abort record in their own logs. Finally, all participants send a final acknowledgment message to the coordinator and execute the relevant procedures for either committing or aborting the transaction. The acknowledgment message, moreover, is not simply an acknowledgment that a command has been received, but is a message informing the coordinator that the command has been recorded by the participant in its stable log record. When the coordinator receives the acknowledgment messages from the participants, it enters a “complete” record in its log.
Although widely implemented, the 2PC protocol nonetheless has several disadvantages. First, as set forth above, the protocol requires each replicated database facility to submit a READY message before the transaction can be committed. Thus, in a fully replicated environment, any site or link failure brings all activity to a complete halt until the site or link is repaired, since that site cannot transmit a READY message. That is, until the failed site is recovered, no further transactions may be executed by a system relying on 2PC. Second, 2PC requires the transmission of at least three messages per replicated database per transaction. The protocol thus consumes substantial communications resources and reduces the system's response time and throughput. Third, 2PC requires both the coordinator and all participants to record the commit/abort decision and the final outcome to stable storage. This involves two forced disk writes per participant per transaction, adding significant overhead to this protocol. Other protocols, such as Quorum Consensus, have been proposed as a solution to the first problem, but these other protocols impose even more communications overhead than 2PC and, as a result, they have not been utilized in commercial systems.
In the primary-backup approach, all transactions update a single, specific replica site, referred to as the primary site. These updates are later copied to the other replicas in the system, which are referred to as backup replica sites. The precise manner in which the updates are propagated to the backup sites varies from implementation to implementation. For example, some systems update the backup replica sites as soon as possible, typically resulting in minimal delays of several seconds. Others update the backup sites at specific time intervals or after a specific number of transactions have committed at the primary site. Some systems, moreover, perform the backup function by transferring entire recovery logs in order to perform the transactions at the other backup sites. Still others create a deferred log of transaction requests which are later used to do the updates. Commercial products incorporating the primary-backup approach to replication include Sybase Replication Server, the Oracle Snapshot Facility, Oracle Symmetric Replication, Oracle Standby Database, Ingres/Replicator and DB2 Data Propagator.
One of the apparent advantages of the primary-backup approach is the ability to create a highly available database system by replacing a failed primary with one of the backups, allowing the backup to become the new primary. This approach, however, has several drawbacks. First, update propagation to the backups typically generates a large amount of network traffic, consuming significant network resources. Second, regardless of the precise manner by which updates are propagated, the backups will always lag the primary. Transactions, moreover, are typically executed serially at the backup sites to avoid data inconsistencies resulting from possibly different execution orders at the primary and backup sites. Hence, in high volume applications, backup sites can lag the primary by tens if not hundreds of transactions. This has serious data consistency consequences both during normal processing and, in particular, after failures.
During normal processing, applications typically access the backups for read-only purposes to improve processing capabilities. Nonetheless, as mentioned above, data at the backup sites may be stale, causing potential problems depending on application requirements. Furthermore, after a primary site failure, both database and real world inconsistencies are likely to arise due to update decisions at the new primary based on stale data. For example, if the sale of the last widget in stock was recorded at the primary site but not propagated to any of the backup sites by the time of a primary failure, then the last widget may be sold a second time by a transaction executing at the new primary.
In addition to being prone to data inconsistencies, the primary-backup approach does not automatically allow for transparent failover to a backup site after a primary failure. First, after a primary failure, application clients must be switched over to a new primary. This process involves significant time during which the system is unavailable. Second, since the backup sites are not always consistent with each other, difficulties arise choosing a new primary from the various backups. Moreover, failures that result in network partitions may result in more than one backup declaring itself the new primary.
In addition to the distributed transaction and primary-backup approaches to database replication, at least one attempt has been made to utilize state machines as a basis for replicating data at different sites. This system, however, requires all transactions to be executed serially at all replicas and thus does not support the concurrent execution of transactions. Basically, a state machine is a entity containing a set of states and a set of commands which transform those states such that all of the new states are also contained within the machine. The prior state of the art of the state machine approach to replication management is described in F. Schneider Implementing Fault-tolerant Services using the State-Machine Approach: A Tutorial ACM Computing Surveys 22 (December 1990). The basic idea of the state machine approach is to start with some number of state machines, and arrange for commands to be sent to all state machines where they may concurrently and independently execute. In order to achieve consistent data replication, however, the commands must be deterministic. That is, the commands must produce identical results when operating on identical states. The requirement that commands be deterministic presents a significant problem in applying this approach to database systems (or, more generally, to transaction-based systems).