Message retransmission refers to the resending of messages which have not been successfully delivered. In a distributed system having multiple host servers communicating with each other, such as a master node that distributes messages to agents (which may themselves relay the messages to other agents), message retransmission techniques are often employed in an attempt to guarantee successful message delivery. For example, in a distributed storage area network, messages pertaining to the distributed storage area network's cluster directory, which stores information on the state of the cluster and the health of disks and nodes, may be sent between cluster nodes and retransmitted, as appropriate, so that the nodes have a consistent view of such cluster directory information.
Traditional message retransmission techniques can lack robustness. During retransmission, the message may fail to be delivered again. For example, the master node may be out of memory and unable to process the retransmission request. As a result, some messages may not be received by the agents, thus making those agents fall out of sync with the master, resulting in an unstable distributed system. To ensure such consistency, the agent may be removed from and rejoined to the cluster, with a snapshot of, e.g., the current cluster directory being copied to the rejoined server. However, a large amount of network traffic may be required to send such a snapshot and the cluster service may also need to be temporarily halted.