Large-scale networked systems are commonplace systems employed in a variety of settings for running service applications and maintaining data for business and operational functions. For instance, a data center within a networked system may support operation of a variety of differing service applications (e.g., web applications, email services, search engine services, etc.). These networked systems typically include a large number of nodes distributed throughout one or more data centers, in which each node resembles a physical machine or a virtual machine running on a physical host. Due partly to the large number of the nodes that may be included within such large-scale systems, rolling out an update to program components of one or more service applications can be a time-consuming and costly process.
Similar to other articles of software, the service applications running on these networked systems require updates to the program components installed on the nodes of the data centers. Therefore, it is necessary to implement a process that takes down, installs new version(s) of software, and brings back online the program components within the nodes. Generally, taking down a large number of program components simultaneously will create unavailability issues with the service application(s) experiencing the update.
Accordingly, highly available service applications, which have their underlying software updated at various times during their operational lifetime, require that a large portion of the service application remain online while the update is occurring. As the service applications grow in complexity and load, the process for conducting an update should include the ability to test features of new versions of software, to limit risk by phasing the rollout of an update, and to retract the rollout of the update (i.e., rollback) if failures are detected during testing. (This is especially true if the service application shares state between component programs (e.g., role instances) that reside at different versions of software. Presently, conventional solutions are unable to achieve the three attributes above while, at the same time, maintaining the service application as highly available to clients. Instead, these conventional solutions are ad-hoc techniques that either permeate the service as a whole, creating loss of performance and limiting availability, or treat the mid-update state as an aberration with significant loss of capability.