A distributed system typically consists of a number of data processing machines interconnected by a data communication network. An important class of distributed systems is that in which data at one site can be accessed transparently by data processing programs executing at another site. A general description of such distributed database systems is provided in the article `What is a Distributed Database System` parts 1 and 2 (C J Date, The Relational Journal, Nos 1 and 2, 1987).
In a distributed database system, data may be split up and stored at several sites with the objective of locating it near to the processes which access it in order to reduce the data traffic on the communication network. However, it is usually the case that some of the sites have to access data located at another site. This remote access increases the cost and delay involved in data processing operations, so that the processing performance of these sites may be significantly worse than that of an equivalent stand-alone system with its own data.
An additional problem is that failure of the communications links or of data processing machines at other network sites may prevent remote data from being accessed at certain times. The availability of the data is therefore worse than if each site was a stand-alone system. Although the purpose of a distributed system is to allow users to share data resources, these negative effects can tend to deter users from relying on remote data access. This in turn detracts from the value of a distributed system compared with a simple centralized system.
A constant aim in the field of distributed systems, therefore, is to provide access to remote data with performance and availability which, as nearly as possible, match those obtainable for local data. One way of achieving this is to replicate data across the network, so that most data accesses can be satisfied by a local or nearby copy of the required data. This approach is described in an article by Sang Hyuk Son, SIGMOD Record, Vol 17 No 4 (1988). In this technique a balance must be struck between reductions in network traffic (and cost) for data accesses and the additional network traffic required to propagate updates to the multiple copies of the data.
Data replication is used in several types of distributed systems ranging from local area network file servers using caching to distributed database management systems using replicated data. An important class of replicated data systems is those in which a primary copy of a data object is held at a single data processor, with all other copies of that object being designated as secondary copies. Updates are applied to the primary copy first, in order to ensure that the time sequence of the updates is correct. Revisions to the secondary copies are then made based on the revised primary copy.
Replicating a data object is most useful when that object has a high proportion of read accesses and a low proportion of write accesses. This is because a read access can be performed on a single secondary copy, whereas a write access must be performed on the primary copy and propagated to all of the secondary copies. The cost of a write access is therefore higher than the cost of read. In a distributed system, updating a data object results in remote secondary copies of that object being invalidated and replaced by new copies transmitted across the network, so that that network costs must be added to the other costs of an update.
An extreme case of this approach is the use of "snapshots" which are intended as read-only replicas, for use by decision support applications. Lindsay et al describe how snapshots may be periodically refreshed in IBM research Report RJ4992 "Snapshot Differential Refresh Algorithm" (B. Lindsay et al, 1986) to keep them closer to the current state of the primary data. However, snapshots have no guaranteed integrity and may not be used for transactional data updates.
Where a large number of users update a shared file or database, secondary copies are quickly invalidated and a great deal of network traffic may be generated. This additional traffic may even exceed the reduction in network traffic which replication is supposed to bring about. The practical consequence, as discussed in the article `Structures for Systems of Networks` (A L Scherr, IBM Systems Journal Vol 25, No 1, 1987) has been that replication methods have been held not to be useful for large shared files and databases, which are almost always centralized.
A significant problem in the prior art, therefore, is that although data replication is desirable it has been very difficult to achieve in the important case where the data is capable of being updated by users at multiple sites. In many practical situations, however, applications at distributed sites do not require access to the very latest data and may be able to function satisfactorily with data which is out of date by a known and controlled amount. Examples of this are applications which use rate tables which are updated periodically, and automated tellers systems which use an out-of-date account balance when authorizing cash withdrawals.
A problem can occur if a particular site is arranged to store the primary copy of a data item but in fact a remote site requires significantly more update access to that item than the site holding the primary copy. This would then mean that the remote site would have to transmit update requests to be made to the primary copy and then wait for revisions to be transmitted back to its secondary copy.