The invention relates generally to distributed communications systems and, more particularly, to a method providing for the atomic transmission of messages in a network within a virtual synchrony environment, to thereby enhance the fault tolerance of the system.
A distributed system utilizing a protocol referred to as virtual synchrony (i.e., operating in a virtual synchrony environment) comprises a plurality of process groups, each of which process groups comprises a plurality of processes. Processes are typically distributed among two or more computers so that if one computer fails, the entire process group does not fail. Processes and process groups are configured for managing and executing application programs, and for transmitting messages between the process groups and processes.
Virtual synchrony ensures that a message transmitted to a plurality of destination processes is received by either all or none of the destination processes. Virtual synchrony, furthermore, ensures that messages transmitted in a specific order from one process of the system are delivered to destination processes in the order in which they were initially transmitted. In a system using virtual synchrony, the message order is maintained even though subsequent messages destined for other processes are interspersed with each other. When such interspersed messages are received by the respective destination processes, virtual synchrony ensures that the original message order is maintained by the receiving processes.
A drawback with conventional virtual synchrony is that if a device in a distributed system fails (i.e., a xe2x80x9cfaultxe2x80x9d) during the transfer of a sequence of related messages resulting from a common event, a destination process is unable to determine that all such messages have not been delivered, and will thus not recover from such a fault. Such a fault may result in the propagation of further faults if the process receiving the message subsequently generates actions or messages which depend on conditions or states which may have resulted but for the fault. What is needed, therefore, is a system and method which would enable a distributed system to identify and recover from such faults.
The present invention provides a method for ensuring that all or none of the messages generated by a process in response to an event or incoming message in a virtual synchrony environment are delivered to all of the destinations of every individual message. This is accomplished by assembling into an atomic message multiple individual messages generated by a process in response to an event. The atomic message is transmitted through a system in a virtual synchrony environment and all or none of the messages are delivered to all of the destination addresses of each of the individual messages. A destination process does not respond to any of the individual messages until the entire atomic message has been received. Individual messages not intended for a particular process may be removed by a computer or process at the destination.
By the use of the present invention, the occurrence of faults which result from the partial delivery of messages is minimized. As a consequence, if a failure occurs in the transmission of a message, the propagation of error is minimized. Thus, error recovery, as well as fault tolerance, is enhanced.