1. Background and Relevant Art
Computer systems and related technology affect many aspects of society. Indeed, the computer system's ability to process information has transformed the way we live and work. Computer systems now commonly perform a host of tasks (e.g., word processing, scheduling, accounting, etc.) that prior to the advent of the computer system were performed manually. More recently, computer systems have been coupled to one another and to other electronic devices to form both wired and wireless computer networks over which the computer systems and other electronic devices can transfer electronic data. Accordingly, the performance of many computing tasks is distributed across a number of different computer systems and/or a number of different computing environments. For example, distributed applications can have components at a number of different computer systems
In some environments, components of different distributed applications can interoperate with one another. It may also be that multiple copies of the same distributed application operate in parallel. In these environments, commands and/or state for the distributed applications have to be maintained in case a distributed application or component thereof fails. If a distributed application component malfunctions, commands and/or state for the distributed application can be lost.
Unfortunately, many distributed applications are designed with limited state management capabilities. These limited state management capabilities may be able to recover commands and/or state in standalone or other relatively simple operating environments. However, these limited state management capabilities are often not robust enough to recover commands and/or state in more complex operating environments (where the possibility of loss also increases).
As such, a separate centralized coordination service can be used for maintaining distributed application state (e.g., maintaining configuration information, naming, providing distributed synchronization, and providing group services). Coordination services can include multiple nodes and use replication to help insure that distributed application state is not lost. Coordination services can use active replication or passive replication to replicate state among multiple nodes in accordance with a consensus protocol. In active replication, each coordination service node performs an application command and generates state updates. When consensus is reached on generated results, the generated results are stored at each node (including replication of generated results to any minority nodes). In passive replication, a primary node performs an application command and proposes state updates to other nodes. When consensus is reached on proposed results, the proposed results are replicated to and stored at the other nodes.
Many consensus protocols rely on the notion of quorums. A commonly used quorum is a majority quorum. Using a quorum, when a majority of coordination service nodes agree on a proposed value, the value is declared successful and is locked for replication to all coordination service nodes. Requiring consensusenables progress even if a minority of coordination service nodes crash. Use of a majority quorum also increases consistency since any two majorities intersect on at least one coordination service node by definition. This intersection property guarantees that any majority has at least one coordination service node that has observed all successfully replicated state updates.
Consensus protocols typically assume that nodes can crash and recover. Consequently, the consensus protocol at the core of the service needs to behave correctly under this assumption. Replication is deemed successful when a node has persisted application state to a storage device (e.g., a mechanical disk, a Solid State Drive (SSD), etc.) providing more durable storage. Storing application state to more durable storage (versus more volatile memory, such as, RAM) provides assurances that replicated application state is maintained at a majority of coordination service nodes after a node crash.
However, even more durable types of storage are not free from faults. A fault in durable storage can cause a coordination service node to lose replicated application state that the coordination service node previously acknowledged as being stored. Thus, when the coordination service node restarts, there may not be a majority of coordination service nodes in agreement with respect to application state. Further, in a virtualized environment, a durable storage failure can be caused due to restarting process in a different server. The restart is automated and there are no mechanisms in place to prevent data loss.
For example, a coordination service may include nodes A, B, and C. Nodes A and B may form a quorum and replicate state update x. Subsequently, node B may be subject to a durable storage failure resulting in the loss of state update x. Due to the durable storage failure, node B can be manually removed from and then manually reintroduced into the coordination service. After reintroduction, node B forms a quorum with node C prior to communication with node A. Nodes B and C decide that no state updates have been replicated. Node B and C then successfully replicate state update y. Thus, state update x is lost even though state update x had been successfully replicated.
Unfortunately, loss of state update x can be difficult to detect (and may be referred to as “silent data loss”). Applications using the coordination service may be unaware that state update x was lost. The applications may proceed on the knowledge that state update x was successfully replicated. Further, as long as the applications appear to be operating as intended, technical personnel administering the applications and/or coordination service may have no way to know that state update x was lost. Even after applications begin to malfunction due to the loss of state update x, investigating technical personnel have to consider many potential reasons and may not immediately identify loss of state update x as the reason. As applications operate for longer periods of time prior to exhibiting malfunctions, identifying loss of state update x as the reason can be even more difficult.