Many typical applications executing in computing clusters, including cloud computing clusters, require a high level of availability, redundancy, or other measures of robustness. In such applications, state data is typically propagated throughout the computing cluster to prevent introducing a single node as a point of failure. For example, business-critical applications such as sales and customer billing systems typically must be failsafe against a single point of failure. A node in a computing cluster may be brought down due to any combination of hardware failure, software failure, network failure, power failure, or other unplanned outage. However, software failures (including software bugs, software misconfigurations, crashes due to transient hardware errors, and all other software failures) are typically more common than any other failure source.
In some systems, application state may be propagated through a computing cluster through synchronous update messages sent between all of the nodes of the cluster. Additionally or alternatively, in some systems the application state may be logged synchronously to global storage such as a storage area network or network attached storage volume. In such applications, synchronization between nodes and/or global storage is a major source of communication overhead, equipment and facility costs, and complexity, and also reduces the flexibility of the system. Additionally, such systems fail to scale to large numbers of nodes, such as those needed for cloud computing applications.