A cluster refers to a group of computing machines. For example, FIG. 1 illustrates a system 100 with a cluster 110 that includes a plurality of server racks 120. Each server rack 120 further includes a plurality of machines or servers 130. For example, one rack 120 may hold approximately 40-80 machines 130. The cluster 110 may therefore include thousands, tens of thousands, or more machines 130. A cluster switch 115 may be used for communication among machines 130 in the cluster 110. According to some examples, the cluster switch 115 may be a set of switches.
Clusters support jobs composed of many tasks. These jobs may have requirements, such as system resources, e.g., how much processing capabilities, memory, etc., is or should be used to perform the job or task, and system constraints, e.g., data objects, machine types or pool of machines, software to run, preferences, etc. Job schedulers can schedule the jobs onto one or more clusters that may include many computing machines.
Typically, jobs can be disrupted (e.g., stopped, evicted, halted, etc.) either deliberately or as a result of unplanned events and failures. Different job schedulers may also compete for resources, which can disrupt another scheduler's jobs, for example, by preempting other tasks to make space for their own. However, when many of these job disruptions occur simultaneously, it can affect an overall predictability of a system.