This invention pertains to networking of computers and more particularly to file replication among computer in a network.
File Replication is a convenient way to automatically distribute data stored in files to users, and is widely used in many environments, such as mobile computing, software distributing and data sharing. A good replication method can improve reliability, availability, local autonomy, load balancing, and data access performance.
A simple example of file replication is shown in FIG. 1. In FIG. 1, the system 100 includes a volume of same files that exist in each of the three networked systems S1105, S2110, and S3115. For example, file 120 on system S1105 is replicated as file 120A on system S2110, and as file 120B on system S3115. The major goal of replication is to maintain the volumes consistent with each other. If someone modifies a file 120 in S1105, this modification should be reflected in copies of that file 120A and 120B in S2110 and S3115. A replication module must detect the modification in S1105, and modify the files in S2110 and S3115 accordingly.
There are many different ways of performing replication. For example, in peer-to-peer replication, a system may exchange replication information with any of the other systems directly. Another example is store-and-forward replication, where replication information is transmitted along pre-defined routes similar to emails. There are tight replication algorithms, in which any modification to a file will be seen at all locations immediately, and loose replication algorithms, in which modifications will be propagated to other locations periodically.
Currently, different systems offer different support modules for replications. A replication module designed to replicate files in one system usually cannot work in another system without heavy changes of the module. In other words, replication modules inherently have poor portability.
A further problem is ordering the replication of data changes. When a number of pieces of data have been modified at one of the systems, a replication module may have to decide the order of processing the modified pieces. Because resources are limited, the order of processing may affect the overall performance substantially. If a replication module chooses to replicate the following three kinds of pieces before others, the overall performance will likely suffer:
Large pieces of data (which will increase delay time)
Pieces of data that are likely to be modified again (which might have to be replicated repeatedly)
Pieces of data that are less likely to be accessed at the destinations (which can waste needed resources at this point)
Existing replication modules do not have any strategy to make good choices without outside help in such situations, so there is nothing to prevent them from selecting these three kinds of pieces first. Most existing replication modules process modified pieces of data on a first-come-first-serve basis, even if information useful to make intelligent choices, such as data length, is conveniently available to them. In other words, replication modules are dealing with data all the time, yet they fail to take advantage of that experience in doing their jobs.
U.S. Pat. No. 4,432,057 to Daniell et al., issued Feb. 14, 1984, titled xe2x80x9cMethod for the Dynamic Replication of Data Under Distributed System Control to Control Utilization of Resources in a Multiprocessing, Distributed Data Base System,xe2x80x9d and U.S. Pat. No. 4,620,276 to Daniell et al., issued Oct. 28, 1986, titled xe2x80x9cMethod and Apparatus for Asynchronous Processing of Dynamic Replication Messages,xe2x80x9d are remotely related patents. The Daniell patents primarily focus on how to process replication tasks based on status of system resources and pre-defined user preference. However, the Daniell patents require extensive overhead, are not transparent to administrators/users of data replications, and do not substantially improve overall performance.
Accordingly, needs remain for an infrastructure that supports various replication modules and implementations, and for a replication method that can utilize information about the data stream to transparently optimize file replication with little overhead.
An infrastructure on a file server can register one or more replication modules to support file replication across a plurality of file servers. Each replication module can support file replication for any number of volumes on the file server. Registering the replication module includes the replication module identifying itself to the replication infrastructure, identifying those activities the replication module wishes to be notified about, and identifying those volumes the replication module will support. The replication infrastructure is responsible for tracking activities on each volume and informing the supporting replication module when a watched activity occurs. The replication infrastructure is also responsible for receiving an update from the replication module, and applying the update to the supported volume.
The foregoing and other features, objects, and advantages of the invention will become more readily apparent from the following detailed description, which proceeds with reference to the accompanying drawings.