Large-scale networked systems are commonplace systems employed in a variety of settings for running applications and maintFaining data for business and operational functions. For instance, a data center may provide a variety of services (e.g., web applications, email services, search engine services, etc.). These large-scale networked systems typically include a large number of nodes distributed throughout the datacenter, in which each node resembles a physical machine or a virtual machine running on a physical host. Due partly to the large number of the nodes that may be included within such large-scale systems, deployment of software (both operating systems (OSs) and applications) to the various nodes and maintenance of the software on each node can be a time-consuming and costly process.
Traditionally, software is installed and upgraded locally on each node in place such that installation and updates are specific to the individual nodes. Because the nodes will be installing the software upgrades individually, there is a likely chance of failure or variability upon performing the installation. Further, other specific operations, such as servicing or customization, may also be performed on the individual nodes. Potentially, these operations change the state of the operating system that is running on a computer node, and often the operations result in introducing indeterminism in the operating system state (as measured from node to node). Further, the operations applied specifically to each individual node may cause reliability and repeatability issues because the operation is repeated many times, thus, increasing the chance of failure.
Accordingly, when updating thousands of nodes, there is no guarantee that all of the nodes will be running software consistently or providing a similar operating system state. For instance, changes to a local software state (e.g., operating system configuration state) may occur due to human or software errors. Often, state changes cause the behavior of the node to become unpredictable. Also, there is no guarantee that each node will achieve a successful update.
By way of example, consider two machines receiving a servicing package that is being installed on each of the machines individually. Upon installing the package to the two different machines, there is no real guarantee that upon completion of the installation that both machines will reboot in exactly the same state. This is often caused by not knowing or accounting for a difference in the initial state of each machine, or numerous other factors that can make the machines distinct. Thus, it is indeterminate what the final state of the machines will be. Because there is no guarantee of consistency between the machines, a service application running thereon will execute unpredictably and provide various users of the service application an incongruent experience.
As such, the current solutions for installing software applications, which rely on curators of the data center to manually install the software applications individually, are ad hoc solutions, are labor-intensive, and are error-prone. Further, these current solutions do not guarantee a reliable result that is consistent across the data center. These shortcomings of manual involvement are exaggerated when the data center is expansive in size, comprising a multitude of interconnected hardware components, that support the operation of a multitude of software applications.