Workload schedulers are commonly used in computing systems to control execution of large numbers of work units (i.e., any activities suitable to be executed thereon, such as batch jobs). For this purpose, each workload scheduler arranges the work units in a workload plan. The workload plan defines a flow of execution of the work units according to corresponding constraints (for example, the expected execution times of the work units and dependencies on other work units).
The workload plan is aimed at achieving one or more desired targets. For example, these targets comprise the completion of work units providing deliverable items (such as printed reports) within predefined workload deadlines. The meeting of the targets of the workload plan is important. Typical examples of when meeting the targets is important are when the execution of the work units is required for other system/business activities, or when a Service Level Objective (SLO) has been negotiated between a service provider implementing the execution of the work units and customers thereof (wherein the service provider has committed to provide a corresponding service with a specific level of performance, especially in terms of reliability and responsiveness). Therefore, any problems in the execution of the work units that cause the missing of some targets of the workload plan may have quite serious consequences (for example, system/business outages, payment of penalties).
Critical path methods (CPMs) are available to facilitate the management of the workload plan. Generally, these critical path methods identify critical paths in the workload plan as defined by the work units belonging to the longest paths to the workload plan's targets (according to expected durations of the work units estimated from their previous executions). This information pertaining to the critical paths allows determining the impacts of any problems that may be experienced in the execution of the work units (for example, a failure or a delay) on the whole workload plan.
The problems in the execution of the work units may be caused by a number of reasons (for example, errors, unavailable resources, either temporary or permanent). Whenever such problems occur, diagnostics activities are performed in an attempt to identify the cause of each problem and to either fix the problem or bypass the problem. These diagnostic activities are quite time consuming by generally requiring deep investigations mainly based on manual activities. However, human resources that are available for performing the diagnostic activities are generally limited and costly (due to breadth and depth of skills required). Therefore, the solution of the problems timely (especially, before missing the targets of the workload plan) is quite challenging. Particularly, when multiple problems occur at the same time, it is very difficult to allocate the available human resources at best for limiting impacts of the problems on the workload plan.
In different contexts, several techniques have been proposed for managing errors. For example, a technique may be used for determining the impact of a failure of a component on one or more services that the component is supporting (according to real time data feeds that are received from processing nodes running the components and a corresponding mapping). A technique may be used for prioritizing error notification based on a cost of each error type (depending on importance of correcting the error type, level of agreement between those who fix the errors and those who determine the importance of correction of the errors and an estimate of other error types caused by the errors). A technique may be used for determining impact of faults on network services (based on discovering devices in the network that are respectively connected to any specified device, to assist in performing an intended task, and then discovering each service that is configured to run on each of the devices).