Modern distributed computing systems comprise components that are combined to achieve efficient scaling of distributed computing resources, distributed data storage resources, distributed networking resources, and/or other resources. Such distributed computing systems have evolved in such a way that incremental linear scaling can be accomplished in many dimensions. The resources in a given distributed computing system are often grouped into resource subsystems such as clusters, datacenters, or sites. The resource subsystems can be defined by physical and/or logical boundaries. For example, a cluster might comprise a logically bounded set of nodes associated with a certain department of an enterprise, while a datacenter might be associated with a particular physical geographical location. Modern clusters in a distributed computing system might support over one hundred nodes (or more) that in turn support as many as several thousands (or more) autonomous virtualized entities (VEs). The VEs in distributed computing systems might be virtual machines (VMs) and/or executable containers in hypervisor-assisted virtualization environments and/or in operating system virtualization environments, respectively.
Components of the distributed computing systems (e.g., motherboards, motherboard integrated circuits, storage devices, network adapters, etc.) often employ firmware to facilitate operation of the components. For example, the motherboard, network interface card, hard disk drive (HDD), and/or other components associated with each of the hundreds of nodes in a cluster can each have its own respective set of firmware. The components, associated firmware images, and firmware management software tools can be delivered by multiple vendors, each vendor delivering firmware and tools pertaining to that vendor's component or components. The vendor-specific firmware tools and firmware management methods can vary greatly. Further, the firmware for a given component may undergo several updates or revisions over the life cycle of the component, some of which updates are deemed “critical” to proper operation of the component. For example, a critical update may address an issue pertaining to the proper operation and/or security of the component.
Unfortunately, use of vendor-specific techniques to manage firmware in a distributed computing system present limitations at least as pertaining to efficiently updating component firmware from multiple vendors in the system. Specifically, use of vendor-provided tools rely on the system administrator to understand and use the vendor-specific tools for a given component to be upgraded. Implementing such an approach across a distributed computing system that has a large number of components from numerous vendors can consume significant human and computing resources and introduce availability, security, and/or other risks into the system. For example, running a particular vendor-specific firmware management tool for a given component in a node might require a system administrator to bring down the node in order to change its operating system environment to perform a firmware update. The node can then be brought back up by rebooting it in the prior operating system environment. All of the aforementioned approaches present challenges for managing the entire corpus of highly dynamic firmware updates.
Specifically, use of the aforementioned vendor-specific techniques often negatively impact system resource performance and/or availability. With such techniques, for example, the VEs and associated workloads on the node or nodes that are being updated are rendered unavailable during the update process, thus negatively impacting computing resource availability and possibly negatively affecting the user experience. Also, running the vendor-specific tools on certain nodes selected to perform the firmware operations may result in a resource imbalance in the system. In some cases, the selected nodes might fail to complete certain operations due to, for example, insufficient memory and/or storage space. What is needed is a way to schedule resources for performing firmware updates.
Still further, nodes may utilize a variety of software components—including operating systems, hypervisors, and/or other software applications. These software components may also have updates from time to time, and it may be desirable to update software components in a computing system. In some examples, there may be dependencies between software and firmware updates. For example, an operating system or hypervisor update may be desirably performed prior to installation of a particular firmware update. Moreover software components often tend to have dependencies on other software packages. For example, an independently updatable software entity may have dependency on a specific version of the underlying operating system. Managing updates of software and firmware components across a distributed computing system may accordingly be challenging.