In the past, there have been distributed systems that consist of a number of software objects that reside in a number of data centers. The software objects can be replicated databases, or other types of systems. A local-area network, such as an Ethernet, mediates communication between objects in the same data center. Communication between objects that reside in different data centers takes place via a wide-area network, such as a leased phone line. The dispersion of objects across multiple data centers allows a system to be resilient to disasters that cause a data center to go down. The multiplicity of objects within a data center makes each data center fault—tolerant: a data center can continue to deliver its intended function even if some of its objects fail.
The scenario is the following: a given object, called the initiator, wants to invoke a given method in all objects. It is necessary that objects be invoked reliably: informally, the failure of an object should not prevent other (correct) objects from being invoked. The invocation protocol should be efficient: since data centers are connected to each other via wide-area networks, and since such networks are slow and unpredictable, it is desirable to minimize the communication between data centers without compromising the reliability of the system.
There are existing solutions for so-called reliable broadcast. One common way to implement reliable broadcast is message diffusion. With message diffusion, the basic idea is that any receiver of a broadcast message relays the message to all other objects in the system. With this scheme, all correct processes eventually receive the broadcast message. The problem with message diffusion is that any correct object will propagate each message to all other objects, which means that the number of messages communicated across wide-area links is proportional to the square of the number of objects.
Another way to implement reliable broadcast is to use failure detection. If a first object receives a message from a second object, the following takes place. If the first object does not suspect the second object to have failed it does nothing. If the first object suspects the second object to have failed it relays the message to the other objects in the system. The number of message communicated across wide-area links is here proportional to the number of objects.
A protocol (a systematic exchange of messages) has long been sought that would allow invocation of the global set of objects in a fault-tolerant, but still efficient manner. The protocol would not have the number of messages proportional to the number of objects or, even worse, to the square of the number of objects. Those skilled in the art have heretofore been unsuccessful in creating such a protocol.