The present invention relates generally to efficiently managing jobs that fail in a virtualized computing environment. In particular, the invention relates to determining whether a job should be restarted on the same active-mode virtual machine or on a different active-mode or DVFS-mode (Dynamic Voltage and Frequency Scaling mode) virtual machine.
Scheduling a software job to run on a virtual machine in a virtualized computing environment may comprise placing that job in a queue associated with that virtual machine. The virtual machine may, however, be unable to perform the queued job in a satisfactory way if the virtual machine subsequently suffers degraded performance or fails in some other way. When such a failure occurs, predefined “proactive scheduling” rules may determine whether to pause the job in the current queue until the virtual machine recovers or to transfer the job a queue of a different virtual machine.
Many factors may affect the efficiency of such decisions. The amount of lead time or buffer time allowed for performance of a queued job, known repair rates or failure rates of particular virtual machines, constraints imposed by quality-of-service (QoS) commitments, and other factors can affect whether conventional proactive-scheduling rules determine the most desirable response when a failure of a virtual machine threatens the timely performance of a scheduled job.