A cloud-computing environment can comprise a large number (e.g., hundreds of thousands) of servers (or “nodes”), each being configured to execute several virtual machines. To deliver promised uptime service level agreements (SLAs) to customers, the nodes are kept operational, with as little downtime as possible. Various factors can cause downtime, such as planned maintenance, system failures, communication failures, etc. The component that causes downtime also determines the effect of the downtime. For example, in the case of the failure of a virtual machine (an emulation of a computer system), only one virtual machine is affected. However, in the case of a node failure, all of the virtual machines hosted on that node are affected. To meet the SLAs contracted to customers, the downtime caused by higher level components, such as nodes, should be kept to a minimum.