Data replication is used by data storage systems to synchronize data between an original storage system and a replicated file system. A simple way of data replication is to copy the entire contents of the original storage system to the replicated file system periodically. This method, however, is inefficient because it duplicates all the data without regards to whether the data has been modified. Furthermore, it requires a large amount of bandwidth between the original and the replica. An alternative method is to reproduce the same operations on both the original and the replica. For example, when a file is created on the original, the same action is repeated on the replica and an identical duplicate file is created on the replica. As the file is modified, data about the changes is sent to the replica and the duplicate file is updated accordingly using the data received. This method requires a significant amount of bandwidth between the original and the replica, as well as a reliable connection between the two to keep the storage systems synchronized.
Some existing systems improve the operation reproduction method by using log records. Operations of the original system and relevant data associated with the operations are recorded, and sent to the replica system at a later time. Based on the log record, the replica system executes the same operations to synchronize its file system with the original. For example, when a file is created on the original system, a log entry is created to record information such as the file name, permission levels, etc. As the file is modified, one or more log entries are created to record the modification. At update time, the log entries are sent to the replica system, which carries out the operations of file creation and modification in the same order as the original. Although this method does not require a constant, reliable connection between the original system and the replica system, it still demands significant bandwidth since the logs can grow quite large. It is also inefficient, especially when multiple files share the same data.
Another approach implemented by some existing systems uses file system snapshots. After the initial replication, a snapshot of the original file system is taken. At update time, another snapshot is taken for the original file system. The two snapshots are compared, and only files that are different are updated. For example, new files and modified files on the original are copied to the replica, and deleted files are removed from the replica. Although this approach eliminates the need for logs, its bandwidth requirement can still be large, especially when files are frequently modified.
It would be desirable to have a way of replicating data that does not consume too much network bandwidth. It would also be useful if the replication technique is flexible and efficient in selecting the data to be replicated.