Collections of powerful, multi-processor computing devices, typically organized into professionally managed data centers, can make available their processing capabilities to purchasers, thereby facilitating the performance of tasks that could not efficiently be performed otherwise. The more processing that can be accommodated by such a collection of computing devices, the more value can be realized therefrom. Consequently, it is desirable to utilize as much of the data processing capability of a collection of computing devices as possible without negatively impacting those purchasers of such processing capabilities, who seek to utilize such collection of computing devices to perform data processing.
To maximize the utilization of collections of computing devices, such as in a data center context, schedulers typically schedule processing to be performed, typically in the form of discrete applications or processes to be executed or tasks to be performed, on one or more such computing devices. For ease of reference, and in accordance with the terminology used by those of skill in the art, the term “processing jobs”, or, more simply, “jobs”, will be utilized herein to refer to discrete processing tasks that can be individually and independently scheduled and executed. Processing schedulers seek to ensure that processing capabilities of computing devices do not remain unused so long as processing jobs remain to be scheduled. In scheduling processing jobs, processing schedulers typically take into account factors directed to the priority of the job, such as whether a job can be delayed or must be executed instantaneously, or whether a job must be continuously available, or is sufficiently robust to withstand downtime. Processing schedulers can also take into account the location of data that may be processed, or otherwise consumed by, a job, to avoid inefficiencies associated with the copying of large volumes of data.
Unfortunately, even professionally maintained and robustly designed computing devices experience failures that negatively impact their ability to perform processing. For example, hard disk drives that utilize spinning magnetic media can fail due to damage to the media itself, or damage due to the mechanical mechanisms that facilitate the reading of data from such media or the writing of data to such media. As another example, solid state storage media can become unusable due to electrical failures that can negatively impact the ability of such solid-state storage media to retain, and recall, digital data. Other aspects and components of computing devices can, likewise, experience failures. Typically, however, failures are only dealt with reactively, such as, for example, by maintaining redundancy such that the data and processing lost due to such failures are minimized. From a scheduling perspective, therefore, jobs are scheduled as if the computing devices will never fail, with the damage from the failures which, inevitably, occur simply being minimized by the above referenced redundancies.