A job processing framework supporting concurrent computation is widely used since it can process large data set in a timely and efficient manner. In such job processing framework, one job is split into a plurality of tasks implemented at one or more stages. Tasks in each stage can be assigned to different computing nodes for parallel run, thereby promoting job processing efficiency. By way of example, a MapReduce framework is a commonly used parallel job processing framework at the present and has been used in fields such as web access log analysis, document clustering, machine learning, data statistics, and statistics-based machine translation.
When a job is processed in parallel, support is usually needed from some computing or storage resources. For example, when a job is implemented at multiple stages, output of the previous stage (also called “intermediate results”) is generally written into local disks of computing nodes so that it is read as input for tasks in the next stage. When the job processing is finished, the output data of the job can be stored in a storage system, such as a distributed file system. For the job processing related to a large data set, whether the parallel processing system is capable of supporting resource overhead of the job is the concern for a system administrator and a job developer. Job alert can be used to indicate problems of system resource overhead.
Among known solutions for generating a job alert, when a job is submitted, a job processing system will directly process the job. During job processing, if the storage quantity of the intermediate results generated is larger than the available storage space of the local disk, or the storage quantity of the final output data of the job is larger than the available storage space of the storage system, an alert is generated by the job processing system. However, if an alert is generated under the circumstances that severe problems such as insufficient disk space or storage system space have already occurred, it can cause loss and damage to intermediate tasks or output data of the job. In such case, the system administrator or the job developer cannot proactively avoid resource overhead problems like insufficient storage space and has to passively repair the fault after knowing the occurrence of resource overhead problem from the alert, so that the processing system fault cannot be dealt with timely. Furthermore, if the amount of intermediate results generated in a computing node is too large, it also poses a great challenge to the I/O resources and computing resources (e.g., CPU processing) of the computing node.