Existing systems use virtualization to share the resources of a modern datacenter. The datacenter may have a wide range of hardware components such as servers, storage devices, communication equipment, and the like, organized into clusters. Virtualization of the datacenter allows multiple guest operating systems to run in virtual machines (VMs) on a single host, sharing the underlying physical hardware of the host, as well as sharing access to a datastore accessible to the host.
Some existing systems include monitoring features that restart individual VMs if expected communications (e.g., “heartbeats”) are not received within a configurable time window. Inputs and outputs (I/Os) are further monitored for another configurable time window to determine whether the VM is in an operational state. If no I/Os are detected, a failure is presumed and the VM is reset to remediate the failure.
Because the monitoring features of these existing systems are often deeply integrated with VM heartbeating processes, there is no definitive way to determine whether the guest operating system (OS) has crashed or whether the VM heartbeating process has crashed. As a result, decisions to remediate failures may be based on potential false positive failures when the guest OS is still operational, but the VM heartbeating process has crashed. Additionally, the time taken to remediate the failure using the heartbeating and I/O monitoring cycles takes a significant amount of time.