The present invention relates to distributed computer systems, and more specifically, to reliable message propagation in distributed computer systems.
One of the long standing challenges in distributed computing has been the propagation of messages from one system to another. In many distributed computing systems, to maintain data consistency it is critical that each message be delivered exactly once to its intended destination site. For example, in a distributed database system, messages that are propagated to a destination site often specify updates that must be made to data that reside at the destination site. The updates are performed as a xe2x80x9ctransactionxe2x80x9d at the destination site. Frequently, such transactions are part of larger distributed transactions that involve many sites. For the purpose of explanation, a message that specifies one or more operations that are to be performed as part of a transaction are referred to herein as xe2x80x9ctransaction messagesxe2x80x9d.
If a transaction message is propagated multiple times to a particular destination site, the updates from the transaction may be incorrectly applied multiple times. For example, if a transaction message that debits an account xe2x80x9cXxe2x80x9d one-hundred dollars is sent twice to a destination site in which the account is maintained, the account xe2x80x9cXxe2x80x9d may be incorrectly debited two-hundred dollars instead of just one-hundred dollars.
In addition, to maintain data consistency, distributed database systems require that (1) all changes made by a distributed transaction must either be xe2x80x9ccommittedxe2x80x9d or, in the event of an error, xe2x80x9crolled backxe2x80x9d; and (2) transaction messages are to be processed in the order in which they are received. When a transaction is committed, all of the changes to data specified by the transaction are made permanent. On the other hand, when a transaction is rolled back, all of the changes to data specified by the transaction already made are retracted or undone, as if the changes to the data were never made.
One approach for ensuring data consistency in a distributed computer system is by using a xe2x80x9ctwo-phase commitxe2x80x9d sequence to propagate messages between the distributed computer systems. According to the two-phase commit approach, a coordinating system (the source site) is responsible for coordinating the propagation of messages to the participating system (the destination site). For explanation purposes, the dequeue from the propagation queue is the transaction at the source site and the enqueue at the destination queue is the transaction at the destination site. However, in general, the operation at the destination site can be any arbitrary transaction.
The two-phase commit sequence involves two phases, the xe2x80x9cprepare phasexe2x80x9d and the xe2x80x9ccommit phasexe2x80x9d. In the prepare phase, the transaction is prepared at the destination site. When a transaction is prepared at a destination site, the database is put into such a state that it is guaranteed that modifications specified by the transaction to the database data can be committed. Once the destination site is prepared it is said to be in an xe2x80x9cin-doubtxe2x80x9d state. In this context, an in-doubt state is a state in which the destination site has obtained the necessary resources to commit the changes for a particular transaction but has not done so because a commit request has not been received from the source site. Thus, the destination site is in-doubt as to whether the changes for the particular transaction will go forward and be committed or instead, be required to be rolled back. After the destination site is prepared, the destination site sends a prepared message to the source site so that the commit phase may begin.
In the commit phase, the source site communicates with the destination site to coordinate either the committing or rollback of the transaction. Specifically, the source site either receives prepared messages from all of the participants in the distributed transaction, or determines that at least one of the participants has failed to prepare. The source site then sends a message to the destination site to indicate whether the modifications made at the destination site as part of the distributed transaction should be committed or rolled back. If the source site sends a commit message to the destination site, the destination site commits the changes specified by the transaction and returns a message to the source site to acknowledge the committing of the transaction. Alternatively, if the source site sends a rollback message to the destination site, the destination site rolls back all of the changes specified by the distributed transaction and returns a message to the source site to acknowledge the rolling back of the transaction. Thus, the two-phase commit sequence can be used to ensure that the messages are propagated exactly once and in order.
For example, FIG. 1 illustrates a conventional two-phase commit sequence for propagating messages from a source site 102 to a destination site 104. Source site 102 includes a server process 106 and a database 110. Server process 106 includes a transmit queue 114 that is used to store messages that need to be transmitted to destination site 104. In this example, transmit queue 114 currently contains a message (xe2x80x9cTX_Axe2x80x9d) that needs to be enqueued at destination site 104. Similarly, destination site 104 includes a server process 108 and a database 112. Server process 108 includes a receive queue 116 that stores messages that are received from different sites.
In this example, a two-phase commit is performed to propagate TX_A from source site 102 to destination site 104. To perform the two-phase commit, at state xe2x80x9c1xe2x80x9d, source site 102 begins a propagation transaction TX_1 to propagate a message that includes TX_A to destination site 104. Upon receiving a message, destination site 104 begins a transaction TX_2 to enqueue a message TX_A. In this example, it shall be assumed that the enqueue of TX_A will require that certain information be updated within data block 114 in database 112. At state xe2x80x9c2xe2x80x9d, the source site 102 sends a xe2x80x9cpreparexe2x80x9d message to the destination site 104. After preparing the enqueue transaction, destination site 104 must retain the lock on some or all of the data that is contained in data block 114 until it receives a message from source site 102 to commit or abort the enqueue transaction.
Once destination site 104 is prepared, destination site 104 sends a prepared message (state 3) to source site 102 to indicate that it is prepared to commit transaction TX_2. The destination site 104 then waits in an in-doubt state for a message from the source site 102 that indicates whether the transaction TX_2 (enqueue of message TX_A) should be either committed or rolled back. Thus, the destination site 104 cannot release the locks acquired as part of the enqueue transaction until source site 102 responds with a message that indicates whether or not the enqueue of message TX_A is to be committed or rolled back. This may cause other transactions requiring access to data block 114 to be blocked while the enqueue transaction is in an in-doubt state. In certain cases, as when source site 102 fails, destination site 104 may be forced to remain in an in-doubt state for a significant amount of time. Thus, for some systems, such as banking database systems, the delays that can result from failures after a prepared phase in the two-phase commit protocol to propagate messages are unacceptable.
Upon receiving the prepared message, the source site 102 commits transaction TX_1 (the dequeue of message TX_A from the transaction queue). By committing propagation transaction TX_1, a record is stored in nonvolatile memory in database 110 that indicates that transaction TX_2 in destination site 104 must be committed.
At state xe2x80x9c4xe2x80x9d, as part of propagation transaction TX_1, source site 102 sends a request message to the destination site 104 that indicates whether or not the enqueue of message TX_A should be committed or aborted. Upon receiving the request message, the destination site 104 either commits or aborts the enqueue of message TX_A. At state xe2x80x9c5xe2x80x9d, the destination site 104 returns an acknowledge message to source site 102 to indicate that the request message was processed.
Upon receiving the acknowledge message, the source site 102 forgets (removes) the two-phase commit records related to transaction TX_1 and TX_2.
A significant drawback with using a two-phase commit sequence is that once the destination site 104 returns a prepared message to the source site 102 (state 2), until a request message is received from source site 102 (state 3), the destination site 104 must delay the processing of all subsequent messages that are received from other sites and need access to block 114. Since messages are to be enqueued in order, this is likely to occur. Thus, if a failure occurs at source site 102 after destination site 104 has prepared and is in the in-doubt state, destination site 104 will not be able to process any subsequent transaction messages that are received from other sites until source site 102 recovers. This delay seriously degrades the throughput of a distributed system as other sites may also be forced to wait for the source site to recover in order that their messages can be processed at the destination site.
For example, as previously indicated, once destination site 104 has prepared, destination site 104 waits in an in-doubt state until a message is received from source site 102 that indicates whether the changes for TX_A should be either committed or rolled back. However, if source site 102 fails prior to notifying the destination site 104 as to whether the changes for TX_A should be either committed or rolled back (between states 2 and 3), destination site 104 will remain in-doubt until source site 102 recovers. Thus, if the destination site 104 receives a transaction message from another site after source site 102 fails, the destination site 104 will be required to delay the processing of the subsequent message until source site 102 recovers.
One method to eliminate the use of the two-phase commit protocol while still guaranteeing that messages are delivered exactly once is to use a commit sequence number (SCN, or system commit number) to indicate which messages have been delivered to a destination site. A two-phase commit that uses commit sequence numbers for the delivery of messages is described in detail in U.S. Pat. No. 5,870,761, entitled xe2x80x9cParallel Queue Propagationxe2x80x9d, the contents of which is incorporated by reference in its entirety. In this scheme, each transaction that enqueues a message in transmit queue 114, stamps the message with a commit sequence number. Commit sequence numbers are monotonically increasing numbers. The propagator process dequeues all messages with a commit sequence number less than, say SCN_A and propagates them to the destination site. The destination site stores the highest commit sequence number obtained from a given source site in non-volatile memory as part of the same transaction that enqueues the message into receive queue 116. After a failure the source site queries the destination site for the latest commit sequence number that it received and resends all messages that have a higher commit sequence number from the transmit queue 114. This scheme requires that once a message has been enqueued into transmit queue 114 with a commit sequence number, say SCN_A no other messages will be enqueued into the transmit queue with a sequence number less than SCN_A. If this happens, the propagator process will not send these messages as the messages will not satisfy the criterion of all messages with commit sequence number greater than SCN_A. In most database systems it is impossible to generate a sequence number for the message atomically with the commit of the transaction. In other words, the sequence number that is stamped on the message is only xe2x80x9cclosexe2x80x9d to the true commit sequence of the transaction itself. This is because the commit sequence can be exact only if the redo-log can be forced at the commit SCN and the index maintained on the commit SCN can be updated as an atomic change. One technique to achieve the atomicity is to obtain a lock before stamping the message with a commit sequence and releasing the lock after the commit. This guarantees that any other transaction that enqueues the message at the same time will need to wait for the lock and hence will acquire a higher commit sequence number. Clearly this scheme reduces system throughput as only one process can commit enqueues into the transmit queue at any one time. A solution to increase throughput is to let the transactions that commit the enqueue into the transmit acquire a shared lock and the propagator process that dequeues from the transmit queue acquire an exclusive lock before incrementing the commit sequence number. This will guarantee that once the propagator process has encountered a commit sequence number, any messages that are to be propagated in the future will have a higher commit sequence number. However, even this improved scheme has three drawbacks.
1) When the propagator process acquires the exclusive lock no other enqueue transactions that insert messages into the transmit queue can be committed (since they need to acquire a shared lock). This reduces system throughput.
2) The enqueue process that acquires a shared lock must update at least one block for each queue in which it inserted a message with the commit sequence number, commit the transaction and release the lock. Hence the duration of the commit steps is increased and the propagator cannot start transmitting messages during this time (since it needs an exclusive lock). This problem is especially bad for real-time propagation where each propagation batch has few messages and hence many transactions will be needed to propagate the messages (as opposed to batch propagation where fewer transactions will be needed and hence fewer attempts to get the lock in exclusive mode are needed).
3) The scheme cannot support propagation in a priority order since it requires that all messages with sequence number less than the commit sequence number chosen when the exclusive lock was acquired must be propagated before any other messages with a higher commit sequence number (even though the message with a higher commit sequence number may have a higher priority).
Based on the foregoing, there is a clear need to provide a mechanism that can reduce the problems that are associated with a two-phase commit sequence. In particular, there is a clear need to reduce or remove the in-doubt problem that occurs when using a two-phase commit sequence to propagate messages between a source site and a destination site.
There is also a clear need to provide a mechanism that can guarantee that a particular transaction message that is to be sent from a source site to a destination site will be processed once and only once at the destination site.
There is also need for a mechanism for allowing messages to be propagated in order of priority.
The foregoing needs, and other needs and objects that will become apparent from the following description, are achieved in the present invention, which comprises, in one aspect, a method for propagating messages from a source site to a destination site, the method comprising the computer-implemented steps of identifying message information that needs to be sent to and processed at the destination site. After identifying the message information, the message information is assigned a propagation sequence number that identifies when the message information is sent to the destination site relative to other message information sent from the source site to the destination site. A message that is based on the message information is then transmitted to the destination site. The transmitted message includes the sequence number value and a source ID that identifies the source site as transmitting the message to the destination site. After the message is received at the destination site, the propagation sequence number that was assigned to the message information is stored in nonvolatile memory at the destination site.
According to another feature of the invention, in response to transmitting the message to the destination site, the source site stores in nonvolatile memory, propagation information that includes the sequence number, propagation state information and a unique ID which uniquely identifies the message information.
In yet another feature, after storing the propagation information in nonvolatile memory, the source site sends a commit request to the destination site. The source site then waits for a commit acknowledge message to be received from the destination site. In response to receiving the commit acknowledge message, the source site updates the propagation state information to indicate that changes that were included in the message have been committed at the destination site.
In still another feature, the message information is identified by identifying message information that has been inserted into a propagation queue. The message information is dequeued from the propagation queue prior to assigning the propagation sequence number to the message information.
In still another feature, after the message is received at the destination site the message is enqueued for processing. The destination site then waits for a commit request message to be received from the source site. In response to receiving the commit request message, the changes associated with the message are committed at the destination site and a commit acknowledge message is sent to the source site.
The invention also encompasses a computer-readable medium, a computer system, and a computer data signal embodied in a carrier wave, configured to carry out the foregoing steps.