The present invention relates to the field of data replication.
Significant reliability advantages for a large data processing system can be achieved if the system is broken into a network of smaller systems, each carrying a part of the processing load. This requires that the database copies across the network be kept in synchronization. One method for database synchronization makes use of asynchronous replication of database modifications from the database being changed (the source database) to the other copies of that database in the network (the target databases). If database changes are not applied to the target databases in the same order as they were applied to the source database, serious database inconsistency and corruption can occur. This is especially a problem if the data replication engines are multi-threaded so that database modifications flow from the source databases to the target databases over a multiplicity of independent paths. Various methods are described in the present invention which ensure that data replicated via multi-threaded replication paths do not result in replication-induced database corruption or inconsistency.
A. System Replication
As enterprises become more and more dependent upon their computing infrastructure, the continuous availability of these systems assumes an ever increasing importance. A powerful way in which to significantly increase a system's availability is to replicate it. That is, if an independent backup system is provided, one that can continue to provide data processing services in the event of a failure of the primary system.
There are many ways to replicate a system, as shown in FIG. 1:
    a) Cold Standby: The backup system is not involved in the primary system's application unless the primary system fails. If this happens, the failed applications are loaded into the backup system so that processing can resume under the control of the backup system. Cold standbys are usually used with tape backup. Periodic snapshots (and perhaps more frequent updates) of the primary system's database are written to magnetic tapes which are held in safe storage. If the primary system fails, the backup tapes are retrieved from storage and loaded onto the backup system so that the backup system can resume processing. In some cases, the backup tapes are preloaded onto the backup system to shorten the recovery time from a primary failure.    b) Warm/Hot Standby: As is the case with a cold standby, a backup system is provided, one which is not normally involved in the primary system's application. However, all of the primary system's applications are preloaded in the backup system so that it is prepared to quickly take over in the event of a primary failure. In some cases, the applications may have the files open only for read access. In these cases, following a primary failure, the applications must reopen their files for full access before they can take over the processing functions of the primary system. This is called a warm standby. A hot standby has all of its files open for full access and is ready to take over immediately following a primary system failure. In either case, the standby system may be used for read-only activities such as query or reporting, but it is not updating the database.    c) Active/Active: In this architecture, both systems are actively running the same application and are sharing the processing load. If one node fails, application services continue to be provided by the surviving system without interruption except that total capacity has been cut in half.    d) Multi-Node: The active/active architecture can be extended to allow the application to run simultaneously on many processing nodes. If there are k nodes, then the failure of any one node will result in the loss of only 1/k of the total system capacity. For instance, the loss of one node of a four-node system will result in a 25% reduction in processing capacity.
The “availability” of a system is the proportion of time that it will be operational. Typical systems today have availabilities of 99% to 99.99%. It is common to measure availability in terms of “9s.” A system with 99% availability has an availability of two 9s. A system with 99.99% availability has an availability of four 9s. In a 24-hour per day, 7-day per week operation, a system with two 9s availability will be non-operational more than eighty hours per year on the average. A system with four 9s availability will be down on the average less than one hour per year.
It can be shown that replicating a system (that is, adding a backup system) will double the 9s of the system. (See, W. H. Highleyman, B. D. Holenstein, P. J. Holenstein, Breaking the Availability Barrier, Chapter 1, The 9s Game, 1stBooks; 2003.) If k systems are provided, then the resulting application network will have an availability that is k times the number of 9s as a single system. For instance, if a single system has an availability of three 9s, providing a replicate system will increase its availability to six 9s.
There are many reasons to replicate a system, including:
1. to significantly increase the system's availability for its community of users.
2. to provide tolerance to natural or man-made disasters by separating the nodes by large distances so that such a disaster will affect only one node.
3. to allow maintenance or updating of the system one node at a time so that the application does not have to be taken down.
4. to provide locality of operations and data to user groups in the field.
B. Data Replication
Providing replicated processing capacity is only part of the requirement to replicate a system. In order for an application to function properly, it must generally have access to the database that represents the current state of the system (such as part levels for an inventory system, account balances for a banking system, and so forth). Therefore, the applications running on each node in a replicated system should have access to a current copy of the database, as shown in FIG. 1. Not only should the processing capacity be replicated, but the database should be replicated as well.
There are many ways to replicate a database, and each has its own characteristics with regard to recovery time following a failure and the amount of data that may be lost as a consequence of the failure. The means of data replication is chosen in part to conform to a company's tolerance for down time and data loss as determined by company objectives. These objectives are known as the Recovery Time Objective (RTO) and the Recovery Point Objective (RPO). See, W. H. Highleyman, B. D. Holenstein, P. J. Holenstein, Breaking the Availability Barrier, Chapter 6, RPO and RTO, 1stBooks; 2003.
1. Early Systems
In early replicated systems, a cold standby architecture was used; and data replication was predominantly via magnetic tape. Periodic copies of the primary database were written to tape, and that tape was transported to the backup system and used to update that system's database. Backup might typically have been undertaken daily or weekly.
A significant problem with this procedure was that a great deal of data was lost whenever the primary failed. The backup system had available to it a database that was perhaps hours or even days old.
This problem was alleviated by writing to magnetic tape the changes that were made to the database as the changes occurred. These changes could then be applied to the backup database by transporting the change tape to the backup system and by applying the changes to the backup database prior to activating the backup application.
Although less data now was lost, the recovery time for the application could be measured in hours or even days while the change tapes were transported to the backup site and loaded. Furthermore, if the change tapes were destroyed in the disaster, or if they proved to be unreadable, then all those changes were lost.
2. Asynchronous Replication
To solve the problems inherent with tape backups, real-time data replication engines were developed. These engines replace the magnetic tape with a communication network. As changes are made at the primary database, they are communicated in near-real-time to the backup system, where they are immediately applied to the backup system's database. The backup system is typically run as a warm or a hot standby.
Since database modifications are being applied to the backup independently of the source database modifications (and at some time later than the source updates), this is called asynchronous data replication. See, W. H. Highleyman, B. D. Holenstein, P. J. Holenstein, Breaking the Availability Barrier, Chapter 3, Asynchronous Replication, 1stBooks; 2003. Although this results in significantly less delay than data backup using magnetic tape, there is some delay from the time that a change is made to the primary database and the time that it is applied to the backup database. This delay is called “replication latency.” Replication latency can be considered the time that it takes a change to propagate through the replication pipeline and to be applied to the target database. Changes in the replication pipeline at the time of a primary system failure will most likely be lost. However, this generally represents at most a few seconds of data; and recovery at a hot standby can be very fast.
3. Active/Active Applications
The fact that the backup system is now in near synchronism with the primary system allows active/active architectures to be implemented. That is, applications may be active in all nodes of the replicated system, with each application having available to it a nearly up-to-date copy of the database.
The terms “primary” and “backup” databases do not apply to active/active systems since, in effect, every database copy in the network is backing up every other database copy. Rather, databases are referred to as “source” and “target” databases. A change is applied to a source database and is replicated to one or more target databases, which can themselves be source databases updating other target databases.
4. Synchronous Replication
If exact database copies are required or if no data loss is tolerable in the event of a node failure, there are techniques available that will guarantee that all database copies are identical and that no data will be lost. These techniques are generally known as synchronous replication since all changes are synchronized across the network. See, W. H. Highleyman, B. D. Holenstein, P. J. Holenstein, Breaking the Availability Barrier, Chapter 4, Synchronous Replication, 1stBooks; 2003.
Synchronous replication techniques may be dichotomized into weak synchronous replication and strong synchronous replication. Strong synchronous replication guarantees that the modifications are in fact applied to the target database when they are applied to the source database.
Weak synchronous replication guarantees only that the changes included in a transaction have been received by the target site, but it does not guarantee that they have been applied to the target database. At this point, the source system is free to commit its transaction. However, there is no guarantee that the changes that are safe-stored by the target system can in fact be subsequently applied to the target database. If pending transactions at the target system are aborted after the source system has committed them, then the databases are no longer in synchronism.
There exist methods for coordinating commits across a network by using Ready to Commit (RTC) or Ready to Synchronize (RTS) tokens returned from the target node to the source node in response to a source node query indicating that the target node is ready to commit. (See, U.S. Published Patent Application No. 2002/0133507 (Holenstein et al.)). These tokens can also be used to implement weak synchronization by using them to respond to a source system query to inform the source node that all of the modifications required by a transaction have been received and/or have been safely stored but have not necessarily been applied to the target database. Whether using strong or weak synchronous replication, the source node's transaction can be allowed to complete once it has received this response from the target system.
Guaranteeing that database updates will be applied to the target database is also a problem with asynchronous data replication. In effect, weak synchronous data replication is a halfway step between strong synchronous data replication and asynchronous data replication.
Both strong and weak synchronous replication have performance issues which asynchronous replication does not have. Because each update must be coordinated among all copies of the database across the network, the application is slowed. The increased time that an application must take to provide this coordination is known as “application latency” and adds to the transaction response time.
Asynchronous replication does not affect transaction response time since the remote database updates are made without any application involvement. However, both asynchronous replication and weak synchronous replication may be subject to data collisions—described later—which strong synchronous replication avoids. Only the effects of asynchronous replication and weak synchronous replication are considered herein since there are generally no database consistency problems introduced by strong synchronous replication.
5. Physical Replication
There is another form of data replication which is mentioned here for completeness, and that is physical replication. Some systems will replicate data at the physical level. That is, whenever a disk block has changed or a time interval has expired on a changed block, that block is queued for transmission to the target system, where it is applied over the existing block at the target system. Physical replication has several limitations. They include:
a. If there is data corruption due to some fault at the source node, then the corruption will be replicated to the target node.
b. Physical replication ignores event and transaction ordering since blocks may not be sent in event order but according to some other algorithm. Similarly, the event order of changes within a block is not preserved.
c. Physical replication ignores transaction boundaries, so inconsistent views of the database may be quite common.
d. There is no guarantee that indices will match the base table data at any given point in time.
e. Physical replication does not support heterogeneous replication—both the source and target databases and systems must be the same or very similar.
f. A great deal of data that has not changed is sent over the communication line as part of each physical block.
On the other hand, physical replication can be much faster. It can play a role in unidirectional replication. However, since it provides no semblance of data consistency at the target database, it is hardly suitable for active/active applications and is considered no further herein.
6. Asynchronous Replication Issues
There are several issues associated with asynchronous data replication. First, as noted above, changes in the replication pipeline may be lost in the event of a source system or network failure. This data loss can be minimized by minimizing replication latency, and lost data is generally recoverable once the failure has been corrected (assuming that the source system has been recovered).
Second, the ping-ponging of changes in active/active applications must be avoided. Ping-ponging is the replication of a change received by a target system back to the source system. It is a form of data oscillation or data looping. There are techniques available for ping-pong avoidance, such as described in U.S. Pat. No. 6,122,630 (Strickler et al.). (Certain synchronous replication techniques are also subject to ping-ponging.)
Third, asynchronous data replication is subject to data collisions. Data collisions are caused by replication latency. Because there is a delay from the time that a change is made to the source system and the time that it is applied to the target system, there is a time interval during which the value of that data item at the target system is stale. That is, the value at the target system is incorrect. If an application at the target system change a stale data item, then it will replicate the resulting value to all other systems in the network at the same time that the original source item is being replicated. Each of these replicated values will be different, and each probably will be wrong. This is known as a data collision, and the database is now in an inconsistent state. Data collisions must either be detected and corrected, or they must be avoided. (See, W. H. Highleyman, B. D. Holenstein, P. J. Holenstein, Breaking the Availability Barrier, Chapter 3, Asynchronous Replication, 1stBooks; 2003, for a discussion of data collision detection and correction.)
Data collisions can be avoided in one of several ways. For example, the database can be partitioned such that each node owns a partition; and only the owner of a partition can change that partition. If this is done, data collisions will not happen. If, however, all nodes must be able to change all data items, then synchronous replication can be used to guarantee that all copies of a data item are changed before any copy can be further changed.
If data collisions cannot be avoided, then they can cause the database to be placed into an inconsistent state. A fourth problem with asynchronous data replication is that data collisions and other situations or configurations may cause the database to become inconsistent. These situations are said to violate the referential integrity of the database. Referential integrity and its violation situations are described in more detail later.
C. Definitions
The following definitions are provided to promote understanding of the invention. For clarity, the definitions are phrased with respect to a scheme that replicates only two databases. However, the scope of the invention includes schemes where replication occurs between more than two databases.
transaction—A transaction is a unit of work consisting of one or more individual steps and/or operations, collectively called the transaction data, to be applied to one or more local and/or remote databases as a single atomic unit of work. A characteristic of transactions is the requirement that either all steps and/or operations are applied or all are rolled back in the case of a problem so that the database(s) is always left in a consistent state. Transactions are often identified by a number or name called the transaction identifier. The transaction identifier is often, though not necessarily, unique. An example of an “individual step” would be to insert a record (row) into the database. An example of an “operation” would be the procedure which increases the price column of all rows in the database by 10%.
serialized data—in a replication system, this refers to the order the data changes are applied into the target database. The purpose of serializing the data is to avoid corrupting the target database (meaning the goal is to avoid ending up with the wrong records or values in the target data).
In its purest form, it is referring to preserving the order of the events both within a transaction, as well as the order of the events across all other simultaneous transactions. This “pure” form, sometimes referred to as the “natural flow of transactions”, may not be attainable on many systems as the order the events are extracted from the change queue (even if they need to be further sorted) may not always be able to be re-ordered into this pure form, e.g. perhaps because the intra or inter-transactional order is not fully preserved, important, or recreatable. What is required for data serialization is that the order the transactions start, and/or end, whether they be committed or aborted, is recreatable for replay into the target database for “related” events.
re-serializing data—refers to re-ordering the data changes into a serialized order after the data has become unserialized, for example if multiple paths are used to deliver change data to a target system the data may need to be re-serialized before it is replayed.
related events—refers to whether the database change events refer to, or are otherwise dependent on, other events. For related events, with some specific exceptions, the order of replay should be serialized (if not during the replay, then at transaction end time).
Data (or database) constraints are generally used for preserving and enforcing the relationship of the related data. The concept of related data is often referred to as preserving the referential integrity of the data.
Events can be unrelated (which means replaying them in any order does not cause any corruption or data referential integrity issues), loosely-, or weakly-, related (which means that replaying them out of order can cause a perhaps brief referential integrity issue, but no corruption issues), and tightly-, or strongly-, related (which means that replaying them out of order will cause target database corruption).
An example of unrelated events would be records from different files or tables for different applications that have no relationship to each other—these could be replayed in any order provided they are all unrelated to each other.
An example of loosely related events would be the relationship for a customer name/address record to always exist for the purchase order header record(s) that refer to it. When loosely related events are replayed out of order, there can be brief periods of referential integrity violations, but these will generally self-correct over time as all the events are eventually replayed (i.e., when inserting a new customer record and the first purchase order for that customer, one can replay either first—in the end, both exist). Note that loosely related events can be made to be tightly related by turning on strict referential integrity checking in the target database such that the transaction is rejected if the referential integrity relations are not satisfied, at a minimum by commit time (database “constraints” are generally used for this purpose).
An example of tightly related events is when two updates in a transaction update the same record. If these updates are overlaying absolute values (i.e., replacing the contents of the target record or the same selected fields with the incoming value), these events must be replayed in the correct order (or only the last needs to be applied). Replaying them out of order will result in database corruption. Note that tightly related events can often be made to be loosely related by changing the updates from absolute values to relative values (i.e., for a numeric field, applying the mathematical difference between the original value and the new value rather than the absolute value, for example adding the update difference to the target record/field instead of replacing the target record/field value).
Data Collection/Log Techniques—Definition of Change Queue
The present invention includes many data collection techniques, not just the classic transaction monitor and transaction “redo”/“journal” log, or change queue or audit trail, approach defined and used by many commercial databases, such as ORACLE from Oracle Corporation, Redwood City, Calif.
The primary features of these data collection techniques are that they collect, retain, and serialize the database change activity. Some even provide transactional integrity (i.e., a set of database updates is either fully applied or fully undone). The collected data is saved in a “log” that may be disk-based, memory-based, an application log file, or other queue structure. The data may be stored in transient or non-transient storage. The present invention includes the following data collection techniques:    (1) Reading database “redo” logs. These logs are typically maintained by a transaction processing (TP) subsystem provided by the O/S or database vendor. They typically contain database “before” and/or “after” record images, as these images can be used to restore the before state of a record update (abort situation) or to apply the after state of a database update (commit situation). These monitors typically provide transactional database integrity. Access to these logs, and understanding the format of the data in them, is required to use this technique. Salient features include:
a. The application and the replication engine are usually “loosely coupled” (i.e., they run independently, and can be independently tuned).
b. The TP subsystem usually provides automatic redo log retention/management. The Compaq Transaction Monitoring Facility (TMF) provides this functionality on an NSK system.    (2) Reading database “image” logs. These logs are typically maintained by the application itself, and they contain similar information to the redo logs. The application may or may not provide transactional database integrity. Access to these logs, and understanding the format of the data in them, is required to use this technique. Salient features include:
a. The application and the replication engine are usually “loosely coupled” (i.e., they run independently, and can be independently tuned).
b. The application subsystem usually provides automatic image log retention/management.
The Tenera Plant and Information Management System (PIMS) application, commercially available from Tenera Inc., San Francisco, Calif., provides this functionality in the database “cradle” and log files.    (3) Using database triggers and/or stored procedures to perform the collection of the database change data, and saving the gathered data into a collection log. This technique requires that the database provide a trigger/stored procedure mechanism (or something similar) that is available to the replication engine. Salient features include:
a. The application and the replication engine are usually “tightly coupled” (i.e., they run interdependently), at least for the data collection part.
b. The replication engine must provide its own log retention/management.
The Shadowbase “open” collection environment (e.g., Solaris/Oracle, Windows/MS SQL Server, etc) offers this approach for capturing database change data for the replication engine.    (4) Using an “intercept” library that intercepts application disk I/O calls and saves the database change data into a collection file. This technique requires that the O/S and/or application environment provide an intercept mechanism that is available to the replication engine. Salient features include:
a. The application and the replication engine are usually “tightly coupled” (i.e., they run interdependently), at least for the data collection part.
b. The replication engine must provide its own log retention/management.
c. The application must have the intercept library bound or linked into it (or similar technique). Typically, this requires no coding changes, as the intercept library intercepts the calls, saves the database change data, and executes the I/O (noting if it succeeded or failed).
The Golden Gate Extractor/Replicator product, commercially available from Golden Gate Software, Sausalito, Calif., provides this technique for event capture. The NTI DrNet product, commercially available from Network Technologies International, Inc., Westerville, Ohio, provides this technique for event capture.    (5) Using a “callable” library application-programming interface (API) that performs the application disk I/O on behalf of the application. The library performs the data collection similar to the method used by the intercept library. Salient features include:
a. The application and the replication engine are usually “tightly coupled” (i.e., they run interdependently), at least for the data collection part.
b. The replication engine must provide its own log retention/management.
c. The application must have the callable library bound or linked into it (or similar technique). Typically, this requires application coding changes, as the application must call the API for all disk I/O.
The NetWeave product, commercially available from Vertex Interactive, Clifton, N.J., is an example of this technique.
communication channel—When the source and target databases being replicated are co-located, the communications channel(s) is generally a physical data bus, fabric, or local area network (LAN). When the source and target databases being replicated are geographically disbursed, the communication channel(s) is generally a wire or wireless communication medium, often as part of a wide area network (WAN).
communication network (or network)—a means allowing two or more processes or systems to interoperate with each other.
D. Transactions
Fundamental to database consistency is the concept of a “transaction.” A transaction is a group of related changes that are managed in such a way as to maintain a database in a consistent state. That is, a view of the database at any time will always give a consistent result.
A simple view of the problem solved by the transaction concept can be obtained by considering a banking application that is managing a person's checking account and savings account. If $1,000 is to be transferred from his savings account to his checking account, then that amount must first be subtracted from the savings account and then added to the checking account (or vice versa). If the balances for these accounts are viewed in the middle of this process, then the savings account balance will be reduced by $1,000, but this money will not have shown up yet in the checking account. Even worse, if the system fails at this point, the $1,000 will be lost.
By using the transaction model, this problem is avoided. A transaction manager assures that either all changes within the scope of a transaction are made (the transaction is committed) or that none are made (the transaction is aborted, thus returning the database to its original state). Additionally, the transaction manager usually assures that intermediate states are not readable, generally by holding the updated records locked until the end of the transaction, so that any view of the database is always consistent.
Programmatically, the definition of a transaction's scope is often accomplished by framing the changes comprising a transaction by a Begin Transaction command and an End Transaction command (which may be either a commit or an abort command):
      Begin    ⁢                  ⁢    Transaction        Change    ⁢                  ⁢    1        Change    ⁢                  ⁢    2    ⋮      Change    ⁢                  ⁢    n        End    ⁢                  ⁢    Transaction  
The transaction model has certain properties that are extremely important to database consistency. They are known as the ACID properties (see, Gray, J. et al.; Transaction Processing: Concepts and Techniques, Morgan Kaufman; 1993.):    Atomic—The transaction is an atomic entity. Either all changes are completely executed, or none are executed.    Consistent—Any view of the database at any time is always consistent. If a transaction is in process, one may see the pre-transaction data or be required to wait until the transaction completes, depending upon the system. But one will never see an intra-transaction inconsistent state. There is one common exception to consistency. Some database managers allow so-called “dirty reads”—the reading of locked data that may be in the process of being modified. Such reads may yield inconsistent results.    Isolated—The effects of a transaction are unaffected by other transactions that are being simultaneously executed.    Durable—The effects of a transaction survive system faults. There are many levels of durability. At the very least, the changed data must be written to a persistent storage device such as disk. Furthermore, that data will survive a disk failure if the disk is replicated (mirrored disks or RAID—Redundant Arrays of Independent Disks). The data will survive a disk failure and a processor failure if the system is replicated.
Transactions can be effective across replicated systems with multiple copies of the database since the changes to each database copy may be included in the scope of the transaction. Thus, either all copies will be changed, or none will. Furthermore, the new data values at each copy will not be viewable until it is guaranteed that all copies have indeed been changed. This is synchronous replication, as discussed previously.
However, the transaction model loses some of its protective capabilities if asynchronous replication is used. This is because the replication process may spawn independent transactions at each database copy. There is no coordination between these transactions. They each guarantee the ACID properties within their own database, but there is no such guarantee across the network.
Specifically, the database may have periods of inconsistency due to data collisions described previously or due to referential integrity violations, as described later.
E. Simple Data Replication Model
The basic components of an asynchronous data replication engine are shown in FIG. 2.
There are three components in this simple model.
    1. An Extractor that is responsible for obtaining changes that have been made to the source database.    2. An Applier that is responsible for applying changes to the target database (the Applier can, in fact, be part of the same component that contains the Extractor).    3. A communication channel that is used by the Extractor to send database changes to the Applier and to receive acknowledgement, control, and status information from the Applier.
As shown in FIG. 2, the Extractor may obtain source database changes in any one of several ways. They include:
a) The application program generating the changes may queue change information for the Extractor to read. Though this queue may be memory-resident, it is usually written to a persistent store such as a disk-resident Change Log so that the changes are preserved in the event of a source system failure. (Note: The term “disk” will be used solely for convenience hereafter and is to be interpreted as “persistent storage.”)
b) The transaction manager may create an Audit Trail on disk of all changes made to the database for the Extractor to read. The Audit Trail may comprise several disk volumes, such as a Master Audit Trail and one or more Auxiliary Audit Trails.
c) A change to the database may activate a database trigger, stored procedure, publish/subscribe facility, or similar function that will queue the change to the Extractor. Although this queue may be memory-resident, it is usually written to a disk-resident Database of Change (DOC) file for durability purposes in the event of a source system or network failure.
This simple data replication model is easily extended to bi-directional and multi-node data replication by using multiple data replication engines, as shown in FIG. 3. It may also be extended to provide synchronous replication by adding a facility to coordinate the commits by the various replication engines across the network, for example using the well-known two-phase commit (2PC) protocol. Note that in a multi-node network, there need not be a connection between every node so long as each node has a path to all others. Also, there need not be a database resident at every node. Some nodes may have a partial database or no database at all; instead, they access data from other nodes. Data replication engines may only be needed at those nodes that have a partial or full database copy.
F. Natural Flow
In the simple data replication model shown in FIG. 2, the replicated entity may either be a single data item change or a transaction. If changes are being replicated, then as each change is received by the Extractor, it is sent to the Applier and applied to the target database. Of course, these changes may be buffered or blocked to improve efficiency. If transactions are being replicated, then in a similar manner each begin transaction, set of changes, and end transaction are given to the Extractor as they occur and are sent to the Applier for application to the target database as a transaction.
The form of changes sent down the replication pipeline can vary from one data replication engine to another. Some will send after images of a row or record. Others will send just field or column changes within a row or record. Still others will send the operations or functions to be applied to a specific set of rows, records, fields, or columns or will send only the relative difference represented by the changes. Row, record, field, or column before images are often sent as well. Before images are useful to determine if a data collision has occurred by comparing the source before image to the current target image. Before images can also be used to back out modifications made by a transaction that is subsequently aborted.
In the simple model of FIG. 2, the database change events at the source database are applied to the target database in precisely the same order. Thus, the target database represents an exact copy of the source database as it transitions through its various states, albeit delayed by the replication latency. Although the replication channel may be single-threaded, transactions may be interleaved as they were at the source system.
This simple model preserves the natural flow of changes or transactions occurring at the source database as they are being applied to the target database. (See, Knapp, H. W., The Natural Flow of Transactions, ITI White Paper; 1996.) The requirement to adhere to strict event sequence can be relaxed if events are not related. It can also be relaxed if database consistency is not checked until transaction commit time rather than upon each update. In this case, changes within a transaction can be applied out of order so long as database consistency is satisfied at commit time (of course, changes to the same record or field that occur in a transaction still need to be applied in the same order). These characteristics are taken advantage of in some of the architectures which follow.
If the natural flow of transactions is not preserved at the target database, then the database can become corrupted. This may occur for a variety of reasons. For instance, one transaction may set the value of a data item to 10. A subsequent transaction may set the value of that same data item to 20. If these transactions are applied in opposite order at the target database, then the data item will be left with a value of 10, which is wrong. The database is now corrupted.
A similar problem can happen even if the natural flow of transactions is preserved, but the change order within a transaction is not preserved. Corruption will occur if multiple operations within the transaction are not commutative—that is, the result of these operations depends upon the sequence in which they are executed. The simplest case is when the replication engine is sending changed images of rows, records, fields, or columns. In this case, if multiple changes have been made to a field or column or even to a row or record, an older image may overwrite a newer image if they are not applied in the correct order and will leave the database in an incorrect state.
A more subtle case occurs when only change operations are sent or are computed before the changes are applied. These operations are often commutative and can be applied in any order. But consider a data item A, which is to be modified by adding data item B to it and then multiplying the result by data item C:(A+B)C→A 
If natural flow is not preserved, and if the multiplication is made before the addition, the result isAC+B→A 
This is not the desired result. As will be demonstrated later, more complex data replication models may not inherently preserve natural flow. Such systems can cause database corruption unless special measures are taken to restore the natural flow prior to updating the target database. (There are some system implementations which do not care about database consistency unless there is a failover. In this case, the backup system will fix the database inconsistencies before taking over the processing load. These implementations do not support active/active applications.) The purpose of the discussion here is to describe various measures to maintain natural flow. The net result is concisely summarized by the following statement:    Database changes generally must be applied to the target database in natural flow order to prevent database corruption.G. Referential Integrity
Referential integrity is another property of a database that must be preserved in order for the database to be correct. A database often comprises a complex network of references between the various records or rows stored in its files or tables. It is generally of paramount importance that when a reference is made to a record or row (or a field thereof), the referenced record or row actually exists and that the referenced entities are logically consistent (e.g., an accumulator value in one record or row equals the sum of the individual values in its referenced or referencing rows or records). Otherwise, the application may not be able to function correctly or may not be able to function at all. These references are sometimes known as “foreign keys” or “foreign relationships.” All foreign keys must be resolved in order to satisfy referential integrity.
FIG. 4 gives an example of the requirement for referential integrity. It shows an order processing database. An order is represented in the database by an Order Header row and one or more Order Detail rows. The Order Header row contains the order number and customer id for the order.
Each Order Detail row contains the order number and the part number and quantity required by the order for a particular part. Each part is represented in the database by a Part row that contains the part number, its price, and the quantity in stock.
Finally, there is a row for each customer; this row contains the customer number, contact information, and so forth.
In order to create and process an order, each Order Detail row must have access to its Order Header, which identifies the customer, and must also have access to its Part row to obtain pricing and availability information. In addition, the Order Header must have access to its Customer row in order to confirm the order and to send an invoice.
If the target row of any of these references, or foreign keys, does not exist, then the order cannot be processed. For instance, FIG. 4 shows that the Part row for the second Order Detail row cannot be found. Therefore, the price and availability for this part cannot be determined; and either the order cannot be processed or can only be partially processed.
The existence of a target for every reference satisfies one of the conditions for referential integrity. If a reference is missing, the database has suffered a referential integrity violation. This is a foreign key that can not be resolved.
The record or row that is making the reference is known as the child. The record or row being referenced is the parent. A requirement for referential integrity, therefore, is that a parent must exist before its child is created; and a parent cannot be deleted until all of its children have been deleted.
There are also other application-specific constraints that may be violated if natural flow is not preserved. Ensuring that account balances do not become negative is one example. For instance, if one transaction adds an amount to an account, and if a second transaction debits that account, applying these transactions in the wrong order may result in a negative balance which may cause the debit transaction to be rejected at the target system even though it was accepted at the source system. This will not happen if natural flow is observed at the target system, nor will it happen if referential integrity is turned off at the target system.
Compressed audit trails may also cause referential integrity problems by interrupting natural flow. For instance, a change to a text field may just be recorded in the audit trail as the changed characters and a positional offset. In order to update the target row, the entire field or row must often be reconstructed. One way to do this is to apply the changed characters at the appropriate point to the target record or row. This works if the source and target databases have the same or similar structure but can be a difficult if not impossible mapping task if the database structures are different.
Alternatively, the current source row can be read and replicated in addition to, or instead of, the compressed data and then applied to the target database. However, the contents of the source row may have changed in the meantime. Although the target result should be correct after some later point in time, natural flow may be violated if the source row contents corresponding to the change are not replicated along with the change.
The above examples show the importance of maintaining inter-transaction dependencies. It may also be important to maintain intra-transaction dependencies if referential integrity is checked upon each data item modification. Otherwise, the reversal of a debit and a credit or the reversal of a parent/child creation may cause transaction rejection due to referential integrity checks.
In addition, for purposes of this discussion, referential integrity requires that the ACID properties of transactions be maintained.
Thus, for purposes of the discussion herein, referential integrity is taken to mean the strongest form of referential integrity and includes the requirements that all foreign keys are resolved, that intra- and inter-transaction dependencies are preserved, and that there is no violation of the ACID properties of transactions. In certain cases, as will be shown later, some of these restrictions can be relaxed if they do not affect database consistency or they can be violated during transaction replay as long as the referential integrity is ultimately preserved after the transaction(s) that violated them completely terminate.
In the simple data replication model of FIG. 2, the natural flow of changes and transactions is preserved. Therefore, the same level of referential integrity enforced at the source database will also be provided at the target database. If it is desired to have a higher level of referential integrity at the target database, then business rules can be incorporated into the Extractor, the Applier, or both to reorder transactions or events to provide this. This is especially important if the target system supports a stronger level of referential integrity than the source system, providing that that level of referential integrity is turned on at the target system.
Referential integrity may be enforced at the database level, depending upon the database being used. In fact, it may be enforced as each change is made; or it may be enforced only at transaction commit time. In the latter case, a change may violate referential integrity so long as the violation has been corrected by commit time. The enforcement of atomic changes (all or none) is usually an intra-transaction function, whereas the enforcement of other referential integrity criteria may be either an intra- or an inter-transaction function.
Thus, full referential integrity at the target database can only be guaranteed if natural flow is followed:
Related database changes must be applied to the target database in natural flow order to maintain referential integrity.
H. Current Data Replication Architectures (State of the Art)
There are several data replication architectures currently being used. Many of these are shown in FIGS. 5a, 5b, and 5c. To simplify the following descriptions, the Audit Trail, DOC, and/or Change Log will be referred to generically as the Change Queue.
1. Single-Threaded Replication Engine
FIG. 5a shows a basic single-threaded replication engine that was described earlier. Since all changes are sent to and applied to the target database in the same order as they were applied to the source database, then full natural flow order is maintained, at least so far as it was maintained in the Change Queue.
One complication with this model is that since all events are replicated, even those changes associated with transactions that are subsequently aborted are replicated and applied. The replication engine must have a means to remove these aborted transactions from the target database. This is typically done by the replication engine aborting the transaction at the target database, by applying the before images or undo events to the target database, or by the target system's transaction monitor performing similar tasks.
2. Single-Threaded Replication Engine with DOC
The abort complication with single-threaded replication engines is removed if a Database of Change (DOC) is used to buffer changes at the source or target system before they are applied to the target database, as shown in FIG. 5b. The replication engine can now select from the DOC only those transactions that have committed and can apply them to the target database. Aborted transactions are ignored. However, replication latency—the time that it takes for an update to the source database to propagate to the target database—has been lengthened by the extra storage step of transactions having to pass through the DOC intermediate storage.
Also, to preserve transaction natural flow, the replication engine will generally apply just one transaction at a time from the DOC. This can have a negative performance impact due to the lack of database update parallelism and is especially harmful if very long transactions can block shorter transactions which follow them. For instance, consider a large transaction that updates 10,000 records and a following short transaction with one update that commits just after the first transaction commits. It may take several minutes for the first transaction to be applied to the target database once it commits at the source database. The commit at the target of the following small transaction will be delayed these several minutes as it waits for the large transaction to complete. Furthermore, though processing load was typically level at the source system, the target system will suddenly experience a peak processing load when the large transaction is replayed. This makes tuning the target system extremely difficult.
As noted above, the DOC can either be on the source system or on the target system. The advantage of placing it on the source system is that it can provide transaction retention if the network or target system fails. However, in this case, once the problem is corrected, not only will the target system be heavily loaded during the large transaction replay, but so will the communication line. Furthermore, following transactions are now delayed not just by the processing times of the large transaction's updates, but also by the transmission time of these updates across the communication channel. Additionally, if the source system is lost in a disaster, all of the DOC data is lost as well. The net result of all of these delays is an increase in the amount of data that may be lost if the source system fails.
Note that with a DOC, pure natural flow is not followed since the modifications included within a transaction will not be applied to the target database until the transaction has committed. However, all change events within each transaction will be applied in natural flow order; and transactions themselves will be applied in the correct order with respect to each other. It is just that transactions are not necessarily interleaved as they are at the source system.
3. Multi-Threaded Replication Engine
Both of the above architectures have a performance limitation because they are single-threaded. That means that there is only one replication channel over which all database changes must flow or only one Applier that can be updating the target database at a time. In addition, if a DOC is used, the transactions themselves are generally applied one at a time. Without the DOC, the applying of transactions to the target database can be done in such a way that several transactions are active at a time, just as they were at the source system.
Some replication systems today use multiple threads to significantly enhance the capacity of the replication channel, such as the RDF product from HP of Cupertino, Calif. This is shown in FIG. 5c. In such an architecture, there are multiple replication channels over which changes can be sent from one or more Change Queues through (optionally) one or more DOCs or equivalent to the target database. Of course, if nothing is done to correct it, there is no semblance of natural flow at the target database because related I/O's, e.g., those for the same record and/or file or that are part of the same transaction, are not sent down the same replication path or otherwise coordinated at the apply side to guarantee they are applied in the correct order as part of the same transaction mix as was used at the source. In these state of the art systems today, changes across the replication paths are made in arbitrary order, and the database will typically not be in a consistent state.
Multi-threaded architectures are typically used for unidirectional data replication for system backup. If the source system fails, the target replication engine must first go through the database and make it consistent. This involves completing any transactions that it can and then aborting all transactions that have not completed. Since the unstructured update of the target database usually means that transaction boundaries are also not ordered properly, this can be a complex and timely process.
This architecture, of course, cannot be used in an active/active environment. Often, however, the target system is used for query and reporting purposes. It is important that the users understand that they are working with an inconsistent database and that the results of a query or report may be meaningless or nonsensical.
Multi-threaded replication engines can provide significantly improved performance in active/active applications, however, and will result in a consistent target database if care is taken to re-serialize change events into natural flow order prior to applying them to the target database. The remainder of this specification discusses techniques for doing just this.
I. Background Summary
Maintaining a consistent and uncorrupted database at all nodes in a replicated system network is of paramount importance if the distributed applications are to work properly. When using asynchronous replication, database consistency and integrity can be compromised in several ways:
a) Data collisions may give different views of the database at different nodes, thus corrupting the database.
b) If changes within a source transaction are not applied as an atomic entity, the database may not be in a consistent state until all of the changes have been applied.
c) If changes or transactions are made in a different order at different nodes, the value of a data item may vary at different nodes and may lead to a corrupted database.
d) If changes or transactions are made in a different order at different nodes, then referential integrity may be violated.
In the simple model of FIG. 2, the data replication engine preserves transaction boundaries and natural flow so that corruption types b), c), and d) will not occur. Data collisions may still occur if replication is asynchronous, although there are techniques to detect and resolve data collisions as described in W. H. Highleyman, B. D. Holenstein, P. J. Holenstein, Breaking the Availability Barrier, Chapter 3, Asynchronous Replication, 1stBooks; 2003.
However, more complex data replication models do not necessarily preserve either natural flow or transaction boundaries and may be subject to any of these corruption mechanisms. In those cases which are important to the application, there must be a mechanism to serialize events and/or transactions prior to their being applied to the target database, or at a minimum prior to any record locks being released for the target transactions. Much of the following discussion examines techniques for event and transaction serialization.
In the following discussion, it is assumed that the data replication engine is asynchronous or weakly synchronous. The focus is on the prevention of database corruption in complex data replication engines due to improper ordering of changes or transactions (corruption types b), c), and d)). These are the examples of the corruptions that can occur when natural flow is lost.