1. Field of the Invention
The present invention generally relates to data replication and more particularly to a method and system for lightweight asynchronous data replication that avoids the need for any persistent storage at the replication source as well as any persistent communication channels, and which is independent of the format of the underlying data sources.
2. Description of the Related Art
Enterprise applications increasingly access operational data from a variety of distributed data sources. Data replication involves duplicating such operational data at multiple locations. Replication is used for two reasons. First, it provides higher availability and disaster recovery because applications can be transferred to the replica nodes if the master node is unavailable. Second, it provides better performance and scalability. By maintaining multiple copies at distributed sites, applications can access their data locally, without going over a network and without burdening any single server machine.
The main task in replication is to update data copies located in multiple locations as the original data changes. This is done by continually sending change records (deltas or δs) to the replica locations. It is possible to keep the source and target databases synchronized with each other using a 2-phase commit update protocol. However this imposes a substantial burden on the replication source because every transaction needs to wait until updates are received at and acknowledged by all replica locations. So the typical replication pattern is to decouple the updates at the replication source from the distribution of the deltas.
There are two traditional methods used for such asynchronous replication. The first method, shown in FIG. 1, is to use a persistent change table at the replication source. A capture program running at the replication source continually scans the change log for new changes, and inserts them into the persistent change table as complete transactions. Concurrently, an apply program reads the change table and applies these transactions to the target database sites. The main drawbacks with this approach are reduced data throughput and reduced scalability. First, all changes must be inserted into the persistent change table, and then removed from it to be applied to the targets. This reduces the throughput of replications. Second, a single change table at the source must supply requests from all the replication targets. Thus, this approach does not scale well to large numbers of replication targets.
To avoid this scalability problem, another replication solution uses persistent queues in the communication channel between source and target as shown in FIG. 2. Changes read from the data log are directly entered into a persistent queue, and are picked up from the queue at the target site by the apply program. This solution also suffers from low throughput because of the inserts and deletes into the persistent queue. In addition, the apply program needs to atomically do two operations: delete a change from the persistent queue, and apply the change to the target database. This atomicity is typically obtained through a two-phase commit protocol, causing further loss in throughput.
Besides the scalability and throughput problems, the above two solutions are also quite “heavyweight” because of the need for persistence at the source or in the queue. This persistence is typically obtained through the use of relational database management systems (DBMSs). But increasingly distributed computing applications use a mix of data formats, including files, relational databases, document management systems, etc. Therefore, there is a need for asynchronous data replication that avoids the requirement for persistence at the source and the communication channel, and which can handle a wider variety of data formats rather than just relational databases.