A data transaction, also termed herein a transaction, comprises a unit of work, initiated with a request and completed with a response, which in turn comprises one or more operations. Each operation may have associated data, and a typical transaction comprises an operation where data is read or modified. Combinations of transactions may or may not be ordered. For example, consider a host that intends to move data D from cell X to cell Y in a database of a primary device. This involves three database transactions, with associated modifications of data, in the device:
A: Generate a log indicating the intention to perform B and C.
B: erase D from X.
C: write D in Y.
It is assumed that transactions B and C will not be initiated until A has been completed and acknowledged. Further, it is also assumed that it makes no difference which of the two, B or C, is completed first. Thus, data associated with transaction A (i.e., the log) must be processed before data of B or C are processed whereas in the relation between B and C there is no such requirement.
The above example is illustrative of a general property of any transactions M and N. M and N may be ordered with respect to each other, i.e., M must complete prior to N or N must complete prior to M. Alternatively, the transactions are not ordered, i.e., it is immaterial which of M and N complete first. An ordered tuple convention used herein for transactions, is that if M, N, P are ordered transactions then they are written as (M,N,P)o, so that M must be completed before N, and N before P. In this case (M,N,P)o≠(N,M,P)o. If M, N, P are not ordered transactions, then they are written (M,N,P)no, and (M,N,P)no=(N,M,P)no.
Returning to the database example, if transactions A, B, and C are also mirrored in a secondary device, then it is absolutely necessary that the required order be preserved in the secondary device, so that data associated with A must be stored before data of B or C. Once A has completed, then B or C may be committed. If the necessary order is not preserved in the secondary system, then inconsistent situations will occur in the case of a failure of either the primary or the secondary device. An order-preserving, redundant system will ensure that ordered transactions are committed in exactly the same order in both the primary and the secondary device.
For a system comprising a host coupled to a primary and a secondary device (or more generally, for committing transactions across components of a transaction processing system), there are two basic order-preserving methods known in the art: synchronous methods and asynchronous methods.
In the synchronous approach, the primary device receives a transaction from the host. The primary device gives no acknowledgment of the transaction to the host until the primary device has completed the transaction, the secondary device has also completed the transactions, and, finally, the primary device has received an acknowledgment of the completion from the secondary device. Only then is the primary device allowed to acknowledge completion of the transaction to the host. Synchronous methods are inherently order-preserving, regardless of the need for order in transactions being processed on the devices. Synchronous methods are also inherently scalable, because the system can process several non-ordered requests in parallel and therefore the overall throughput is not generally affected. However, synchronous methods known in the art impose heavy penalties of latency on any system using them, since the primary device must wait for the secondary device to process and acknowledge the transaction.
Asynchronous methods allow the primary device to acknowledge the transaction to the host independently of acknowledgment from the secondary device, and thus inherently solve the latency problem of synchronous methods. However, since asynchronous methods are inherently non-order preserving, an order-preserving mechanism must be introduced into systems using these methods.
One known order-preserving mechanism is for the primary device to process and acknowledge transactions as they are received from the host. After each transaction acknowledgment has been sent, the transaction is placed in a queue for transmission to the secondary device, and the secondary device processes the queued transactions strictly according to the queued order. While this approach solves latency problems of the primary device, it introduces latency problems in the secondary device, which decreases the overall performance of the system. Furthermore, since there is no parallel processing of transactions in the secondary device, the overall system is not scalable.
A second order-preserving mechanism uses a “point in time” copy system. At some time t0 a process for creating a copy of a volume V0 of the primary device is initiated, and it is completed at time t1. The primary device then commits the copy to the secondary device, and if the commitment completes, the secondary device has a coherent image of V0 as it existed at t0 in the primary device. Such a mechanism allows parallel processing of requests, and is consistent in the case of failure. However, the time lag between consecutive images at the secondary device may be relatively long, so that the amount of data lost on failure may be correspondingly large.
An article titled “Seneca: Remote Mirroring Done Write” by Minwen Ji et al., in Proceedings of USENIX Technical Conference (San Antonio, Tex.), pages 253-268, published in June 2003, which is incorporated herein by reference, describes a taxonomy for remote mirroring.
U.S. Pat. No. 5,222,219 to Stumpf et al., whose disclosure is incorporated herein by reference, describes a method for preserving the order of data that is transferred from a first device to a second device. During a first cycle, a first block of data is transferred from the first to the second device, and is simultaneously stored in the first device. During a second cycle, a second block of data is transferred, and a signal is issued indicating success or failure of the first block transfer. In the event of failure, the first cycle repeats.
U.S. Pat. Nos. 5,742,792 and 6,502,205, both to Yanai et al., whose disclosures are incorporated herein by reference, describe a system which stores data received from a host to a primary data storage system, and additionally controls the copying of the data to a secondary data storage system. One or both of the primary and secondary data storage systems coordinate the copying of the data to the secondary data storage system, and maintain a list of the data which is to be copied to the secondary data storage device.
U.S. Pat. No. 5,900,020 to Safranek et al., whose disclosure is incorporated herein by reference, describes how a write operation begins with a request by a processor to invalidate copies of data stored in other nodes. The request is queued while acknowledging to the processor that the request is complete, even though it actually is not. The processor proceeds to complete the write operation by changing the data. The queued request, however, is not transmitted to other nodes until all previous invalidate requests by the processor are complete. The invalidate requests are added and removed from a processor's outstanding invalidate list as they arise and are completed.
U.S. Pat. No. 6,493,809 to Safranek et al., whose disclosure is incorporated herein by reference, describes a method for invalidating shared cache lines by issuing an invalidate acknowledgment before actually invalidating a cache line. An invalidate request is sent from a head node on a sharing list to a succeeding node on the list. In response to the request, the succeeding node issues an invalidate acknowledgment before the cache line is actually invalidated. After issuing the invalidate acknowledgment, the succeeding node initiates invalidation of the cache line.