The present invention relates to techniques for performing lifecycle operations on a stateful clustered service in a manner that reduces downtime of the service and disruption to clients.
Stateful clustered services that provide strong consistency guarantees over the data stored with them, such as state machine replication systems (for example, APACHE ZOOKEEPER™), generally represent a critical service or dependency for their clients. Clients cannot necessarily proceed safely without the stateful clustered service being available. For example, many services require a majority quorum in order to be able to continue to provide service. That is, as long as (n+1)/2 servers are operational, the overall stateful service remains available to clients.
When performing lifecycle operations on a stateful clustered service, such as upgrading the installation or migrating a server, it is important to do so in a manner that reduces downtime of the service and disruption to clients. Therefore, lifecycle operations are conventionally performed in a serial manner, in which one member of the stateful cluster is made unavailable to clients in order to perform the lifecycle operation on that member, while the remaining servers, or at least a majority of servers, remain operational and continue to provide service. Once the lifecycle operation is complete, the server is brought back online and after its successful reintegration into the cluster, the operation lifecycle proceeds with the next server and so on until the operation has been applied to all servers in the cluster. This may be referred to as a rolling restart.
While this approach affords the continued availability of the service, there are a number of problems with existing approaches to rolling restarts of stateful clustered services. First, such systems typically have the notion of a distinguished server that serves as the leader or primary node and is responsible for handling all write requests. If the leader process is taken down, a recovery protocol to elect a new leader needs to be instantiated by the remaining members of the cluster. During that time, the entire service is unavailable to clients. Each time this protocol is executed, there is the risk of leader election taking longer, for example, due to network or other issues, and exceeding session timeouts that the clients may have, which in turn can violate session guarantees. Second, for each server that is taken down for a lifecycle operation, its clients become disconnected and need to reconnect to another member of the group, which causes some delays to clients and shuffles load around the remaining servers in the cluster. Third, in case of migration, such as replacing a server with a new server or replacing a storage device, the member that comes back up needs to synchronize with one of the majority quorum in order transfer the state in its entirety. This causes network traffic and again load on the cluster.
A need arises for techniques to reduce service unavailability, client disruption, network traffic and processing load due to state synchronization.