Field of the Invention
The present disclosure relates to distributed computing. More specifically, the present disclosure relates to a method and system for forcibly completing an upgrade of distributed software in the presence of the failure of one or more nodes.
Related Art
Clustering software brings together independent servers that cooperate as a single system. The clustering software may support rolling upgrades of distributed software in which each node can be upgraded one at a time. During a rolling upgrade, the cluster remains in operation and clients do not suffer from an outage of the cluster. An administrator brings down each node, installs a new software version, and then activates the new software version on the node.
The nodes of a cluster operate at a common level called the acting version. The acting version of the distributed software is a version that can be supported by each node in the cluster. While performing the upgrade on the individual nodes, the nodes continue to operate and communicate under a previous acting version of the distributed software. After upgrading all the individual nodes, the entire cluster can be upgraded to operate according to a new acting version supported by the new version of the software. For example, network communication protocols or disk storage formats are not changed until the acting version for the entire cluster changes.
Often, the cluster being upgraded contains a large number of nodes. As the cluster size increases, there is a higher likelihood of a node failure, causing the node to become inaccessible during an upgrade or patching process. The customer may be able to upgrade a subset of the nodes but other nodes remain inaccessible. For example, a node may become inaccessible due to a fire, hardware or software issues, or power disruption. When one or more nodes of the cluster are inaccessible, the administrator must terminate the cluster upgrade process and downgrade all the nodes. Administrators cannot remove the inaccessible nodes from the cluster during the middle of an upgrade.
Downgrading is a manual, non-rolling process and the administrator must take down the entire cluster, resulting in full cluster outage. For example, with one inaccessible node in a cluster of size n, the administrator potentially performs n−1 steps to downgrade, remove the one node, and redo the n−1 steps to upgrade. After completing the downgrade, the administrator can start up the older software version and remove the inaccessible nodes. After removing the inaccessible nodes, the upgrade process can be restarted with the reduced cluster size.
Unfortunately, downgrading the cluster and taking the entire cluster out of service can severely impact productivity and is unacceptable in most business-critical environments.