Many mission-critical computing applications rely on network-accessible services, e.g., using virtualized resources at cloud-based provider networks. In order to support such applications, the operators of the provider networks may utilize large fleets of hardware servers, which may sometimes comprise thousands of hosts spread over many data centers in many different geographical locations. At least some of the programs used to implement such services, including for example various administrative virtualization management-related programs run locally at virtualization hosts, may be expected to run continuously for long periods of time (e.g., weeks, months or even years) without restarts to support targeted availability levels for customer applications. If such administrative programs are restarted, the customer application programs that rely on the administrative programs may experience unacceptable service interruptions.
As with most programs, updates to the long-running programs may be required at various points in time, e.g., due to the identification of defects and corresponding fixes. Many new software builds of the long running programs may be developed over time, representing functional enhancements, support for newer hardware, defect removals, and so forth. Because of a variety of reasons including the longevity of the programs, the potentially large number of execution platforms in the fleets (at which different updates may have been applied at different times), and the possibility that “hot-patching” techniques which allow in-memory versions of the programs to be modified may have been used at some or all execution platforms, it may not be straightforward to determine exactly which versions of the long-running programs are running at any given platform within the fleet. Such situations may be especially likely in scenarios in which the rate at which new versions of the long-running programs are produced and deployed is high.
Even in scenarios in which records or logs of applied software changes at various execution platforms are maintained, it may sometimes be the case that some of the records may be lost (or may contain errors). When deciding whether a given long-running program contains a particular defect, and whether the program may therefore require a specific version-dependent remedial action, administrators may thus be faced with a non-trivial challenge, especially when the defect could compromise the security of customer applications or infrastructure components.
While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to. When used in the claims, the term “or” is used as an inclusive or and not as an exclusive or. For example, the phrase “at least one of x, y, or z” means any one of x, y, and z, as well as any combination thereof.