Many companies and other organizations operate computer networks that interconnect numerous computing systems to support their operations, such as with the computing systems being co-located (e.g., as part of a local network) or instead located in multiple distinct geographical locations (e.g., connected via one or more private or public intermediate networks). For example, data centers housing significant numbers of interconnected computing systems have become commonplace, such as private data centers that are operated by and on behalf of a single organization, and public data centers that are operated by entities as businesses to provide computing resources to customers. Some public data center operators provide network access, power, and secure installation facilities for hardware owned by various customers, while other public data center operators provide “full service” facilities that also include hardware resources made available for use by their customers. However, as the scale and scope of typical data centers has increased, the tasks of provisioning, administering, and managing the physical computing resources have become increasingly complicated.
The advent of virtualization technologies for commodity hardware has provided benefits with respect to managing large-scale computing resources for many customers with diverse needs, allowing various computing resources to be efficiently and securely shared by multiple customers. For example, virtualization technologies may allow a single physical computing machine or host to be shared among multiple users by providing each user with one or more virtual machines hosted by the single physical computing machine. Such virtual machines may be considered the equivalent of software simulations of distinct logical computing systems, providing users with the illusion that they are the sole operators and administrators of a given hardware computing resource, while also providing application isolation and security among the various virtual machines. Furthermore, some virtualization technologies are capable of providing virtual resources that span two or more physical resources, such as a single virtual machine with multiple virtual processors that spans multiple distinct physical computing systems.
Many providers of cloud-based infrastructure have implemented vary large data centers with thousands of physical hosts, typically using commodity hardware, with many or all of the hosts arranged or mounted in rack configurations. As the number of hosts and racks in a given provider's fleet grows, the absolute number of failures of various kinds that are encountered in a given interval, including software failures, hardware failures, power supply-related failures, and the like, may increase simply as a result of the larger total population of devices in the fleet. At the same time, users of such environments have come to expect very high availability levels for the applications built using the cloud-based infrastructure. The impact on the availability or uptime of the virtual machines may vary by the type of failure. Since a given rack may hold tens or dozens of hosts or devices, a rack-level failure event (for example a network switch failure or a failure of a power distribution unit) may result in correlated outages of large numbers of virtual machines at the hosts mounted on the rack. The negative consequences of such correlated failures may be exacerbated by fact that it may in some cases take operator intervention to diagnose and fix the failure, and as a result the down time for the affected virtual machines may reach unacceptable levels for many applications and users.
While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to.