As the number of nodes within a distributed system of computers increases, so too increases the likelihood of failure of one or more of those nodes. Failures can wreak havoc on a distributed system. For example, failures can cause data to be lost and/or delays in processing. Recovery from such failures is often time consuming and resource intensive, requiring, for example, halting processing across all system nodes, rolling back the nodes to a previous state, and then restarting the system. While this is occurring, tasks and processes may be delayed, thereby interfering with and disrupting workflow.
In the drawings, like reference numbers generally indicate identical or similar elements. Additionally, generally, the left-most digit(s) of a reference number identifies the drawing in which the reference number first appears.