The present invention is directed to applications and/or systems which require weakly-consistent, replicated data storage in order to enable continued operation of component subsets in the face of faults and network partitioning events. More particularly, the invention is directed to propagating updates from an originating replica to all other replicas in an efficient manner.
In the field of distributed computing, questions of how best to allow multiple users access to a distributed database naturally arise. The answers change depending upon the design goals of the system and the expectations of its users.
In some systems that are designed to accommodate the shared access to data among a number of users, some form of data xe2x80x9creplicationxe2x80x9d is typically implemented. To achieve data replication, systems are commonly partitioned into a number of data xe2x80x9cserversxe2x80x9dxe2x80x94each server having a complete replica of the system database. These servers process requests from a number of xe2x80x9cclients.xe2x80x9d Typically, client requests come in the form of Reads and Writes to the system database. Replication of data allows flexibility and availability to clients. If data is completely (and perfectly) replicated across all servers, then a client will always have access to the same system data.
However, the cost of xe2x80x9cperfectxe2x80x9d replication may be prohibitivexe2x80x94especially in systems that have large numbers of clients and servers. Any change (due to a Write) effected by a client necessarily needs to be immediately propagated to all system servers to ensure perfect replication. If the updates are not xe2x80x9cimmediatelyxe2x80x9d propagated throughout the system, there exists the possibility that some client will access a data item at a server that has not as yet been updatedxe2x80x94thus violating the perfection of the replication.
As a practical matter, perfection, as a system goal, is impossible. Different schemes for the relaxation of this condition have emerged. At one end of the spectrum, a xe2x80x9cstrongly consistentxe2x80x9d distributed database provides that changes are either atomically updated throughout the system or provide for some type of a server-initiated callbacks when inconsistencies occur. Although strongly consistent systems provide a high degree of data consistency, the price often paid is availability. For example, a client of a strongly consistent database requesting data that has recently been updated may have access blocked to that data until it has been updated across all servers.
At the other end, xe2x80x9cweakly consistentxe2x80x9d databases allow access to servers and data that may not contain all system updates. Such systems typically are characterized by the xe2x80x9clazyxe2x80x9d propagation of updates between servers (i.e. updated servers propagate updates to other servers over time); thus the possibility exists for clients to see inconsistent values when reading data from different replicas. The use of weakly consistent replicated data has been driven by the needs existing in distributed environments. Another force driving use of weakly consistent systems has to do with improving performance, so that connecting to a nearby server for a read or write will be faster and smoother than connecting to another server. Clients in contact with a server, can gain useful information even when both client and server are partitioned from the other replicas. Weakly consistent replication systems attempt to ensure that these reads are usually reasonably up-to-date, and that eventually a full-state of synchronization will be reached. Thus, a primary motivation for using weakly consistent systems is to provide distributed replication for reliable, although only weakly consistent, access to shared information even in the face of faults and network partitioning or wide-spread network overload, so that long-term synchronization is achieved among replicas.
Unfortunately, the lack of guarantees concerning the ordering of read and write operations in weakly consistent systems can confuse users and applications. A user may read some value for a data item and then later read an older value. Similarly, a user may update some data item based on reading some other data, while others read the updated item without seeing the data on which it is based. A serious problem with weakly consistent systems is that inconsistencies can appear even when only a single user or application is making data modifications. For example, a mobile client of a distributed database system could issue a write at one server, and later issue a read at a different server. The client would see inconsistent results unless the two servers had synchronized with one another sometime between the two operations.
Despite the above noted disadvantages, weakly consistent systems are popular in situations that do not require total consistency in order to be valuable, due to their high-availability, good scalability, and simplicity of design. These advantages arise from the ability to allow reads and writes to be performed with little or no real-time synchronization among replicas.
To implement weakly consistent designs, systems and algorithms which offer a quick propagation of changes have been developed. However, these quick propagation designs such as multicasting, provide low levels of tolerances and have a xe2x80x9cbest effortsxe2x80x9d standard of reliability.
Other designs are pre-configured to periodically send updates by means of pair-wise interactions to neighboring replicas that had not yet received updates which may have taken place. Such an approach insures that eventually all updates will reach all replicas, assuming that all replicas engage in sufficient pair-wise update exchanges so that a transitive closer on propagation is achieved. However, it may take a considerable amount of time for any given update to reach all the replicas whose local clients might be interested in its value.
Thus, a mechanism is needed for weakly-consistent databases, which are to be replicated that allows for obtaining a design that provides the best of a fast replication system which has a xe2x80x9cbest effortsxe2x80x9d standard of reliability, with a highly reliable replication system, that has a high latency factor.
A multi-node computer network is provided having a weakly consistent replicated data storage system with an enhanced update mechanism which enables continued operation of network subsets in the face of faults and network partitioning events. Provided is a multicast communication update facility configured to propagate updates from an originating replica source in the computer network to all replicas of the computer network at a single time using a best-efforts design. An epidemic update communication facility is provided to send updates by pair-wise interaction to non-updated neighboring replicas, wherein the epidemic update communication facility and the multicast communication update facility are employed together in at least some of the replicas.
With attention to another aspect of the present invention, the multicast update communication facility and the epidemic update communication facility are configured in accordance with parameters which allow for the facilities to operate in a parallel non-overlapping manner with respect to each other.
With attention to yet another aspect of the present invention, the multicast update communication facility and the epidemic update communication facility are configured in accordance with parameters which cause the multicast update communication facility to influence a rate at which the epidemic update communication facility operates.