In distributed service (e.g., a distributed database), when an event (e.g., a database query) causes a change in the service depending on a state of the service, it can matter whether the event occurred before or after the state is modified by another event. During normal operations, a variety of tactics for handling these types of race conditions may be employed. However, when there has been a disruption in the distributed service (e.g., a node failure), it can be difficult to ensure that the state of the service when restored is the same as or at least consistent with that service state before the disruption. One way to recover the original state may involve rolling back nodes in the distributed service to a point prior to the disruption and rapidly re-performing events, and possibly undoing incomplete events. To ensure consistency in the state of the distributed system before and after the recovery, the ordering of the events may be determined so that the events can be re-performed in substantially the same order as they originally occurred, or in an order consistent with an originally intended ordering that was partially completed.
For this type of ordering, conventional clocks are often considered unreliable because conventional clocks will diverge across none boundaries, despite occasional synchronization. Instead, some conventional systems use Lamport clocks, which increase their clock values stepwise as events occur on individual nodes. As nodes interact, Lamport clock values may be passed with signals between the nodes. Upon receiving a signal, a node may increase its Lamport clock to a received Lamport clock value when the received value is higher than the local node's Lamport clock value. During a system recovery, events may be replayed in an order determined by their shared Lamport clock values. However, these systems may have high network overhead because Lamport clock values are rapidly passed with signals between nodes as events occur. Other conventional systems may require that each event causing a state change be recorded in one more centralized nodes. These systems may have high network overhead due to every event being reported to the centralized nodes.