Large-scale distributed computing systems may be used to process data in efforts known as jobs. Each job may be made up a plurality of tasks. A task may be considered a unit of computation and fault tolerance. For example, interrupting a task during computation means that the task will have to be recomputed from the beginning, or restarted from a previously check-pointed state.
Several factors may impact the amount of time needed to complete a job. The overall job running time may be the time from when the first task starts until the last required task finishes. An individual task's running time may depend on the availability of the task inputs, task computation time, and the time to produce/write output of the task.
Additionally, in a shared computing environment a task can be preempted or killed, for example by an unrelated machine failure, in which case it may need to be restarted. Other tasks may be slow, for example, because they are running on overloaded machines. Such failures and slow tasks can cause other tasks to become stragglers, meaning that the tasks are lagging behind the overall progress of the remaining tasks in the job.
MapReduce is an example of a high level, large scale data processing framework that uses jobs defined by a plurality of tasks to allow users to express their applications using map and reduce operators. The input data set may be divided into shards, shards are processed by mappers that produce key value pairs as map output, and then the map output may be sent to reducers where all values with the same key are combined to produce a final value for each key. Each reducer may be responsible for a subset of the keys. The process by which data may be sent to reducers may be called shuffling, and results in each reducer getting, from every mapper, the key value pairs for which that reducer may be responsible. The MapReduce framework may be responsible for automatically partitioning and executing in parallel the user specified computation on a computer cluster. In this example, each task may be a map or reduce task. The instance of the MapReduce framework, together with user specified computation, and user specified inputs and outputs may be referred to as a job. A MapReduce job may be considered done when every reduce task finishes.
The running time of the job in a cluster, such as in the MapReduce example described above, may be dominated by running time of straggler tasks. In some instances, more than half of the overall running time for a job may be spent in processing the last 5% of tasks due to failures or other problems. In addition to address such issues, failed tasks may be replaced and straggler tasks may be duplicated. In this regard, backup tasks may be run. For these backup tasks to be most effective, it is important to properly identify straggler and/or long running tasks.
Previously, attempts have been made to select which backup tasks or backup candidates to run in a largely manual effort. For example, one would first define a small number of features relevant to the decision about whether to create a backup. These features may include the relative size of a task, the average processing rate for a task, whether the computer processing that task is slow, whether a computer is having issues reading the output of the task, whether the task is the last needed to complete the job, etc. The features are thus possibly dynamically changing properties of a given task that can be monitored and can be based on at least any statistics available during execution of a corresponding job. Features may also be continuous or discrete. Some features may be static (a property assigned at the beginning of a job and that remains constant throughout the job), global (a property computed periodically and describe the state of the job as a whole and are the same for all backup tasks), or individual (a property computed continuously for each backup task). As an example, in the MapReduce context, a static feature may include “has-output-to-bigtable”=1, a global property may include “map-phase-progress”=0.7, and an individual feature may include “number-of-shuffler-read-errors-for-output-from-this-task”=4.
Each of these features are assigned a weighted value indicative of how important the feature is when determining whether to run backup tasks. These values have typically been assessed manually by referencing failed or straggler tasks in previous jobs. For example, while the job is being computed, the system may temporarily store information about the features for each task for display on a status page in a web browser. However, the information for each feature is typically not collected continuously. An operator reviewing the status page may notice that a job is stuck. In some systems, the computing system may send a message indicating that certain jobs are stuck. A human operator may then need to review the status information for the job in order to determine which tasks are problematic. The operator may then use this information to make an educated guess as to the importance of certain features to backup determinations. In this regard, an operator may manually select weights for different types of features. These weights can then be “hard coded”.
For example, when a new job is to be run, the computing system may calculate a weighted average for each task of the new job. The computer may then use the weighted average to rank the importance of running a backup for each task of that job. For example, if there are 10 available computers, and each is able to execute 2 tasks at a time, 20 highest ranked backups (according to the preselected, weighted sum of feature scores) can be automatically selected to run on these computers. In production, an operator may observe that the job is stuck with given set of feature scores.
To address that problem, the operator may again manually adjust features weights and/or add new features to the code. Thus, the manually determined feature weights may be updated in the code over time. However, as the feature weights are fixed for a given job, or rather, set at the beginning of a job, they can only be changed for subsequent jobs.