Workload management is a common activity that is typically implemented by workload schedulers to manage execution of large numbers of work units in computing systems (for example, batch jobs). For this purpose, each workload scheduler arranges the work units into a workload plan. The workload plan defines a flow of execution of the work units according to corresponding constraints (for example, expected execution times and dependencies on other work units).
The workload plan is aimed at achieving one or more desired targets. For example, these targets comprise the completion of specific work units within corresponding (completion) deadlines; typical examples are work units required for business activities or subject to Service Level Agreements (SLAs), such as relating to daily settlement of payments in financial applications. Therefore, any (execution) issue in the execution of the work units that causes the missing of some targets of the workload plan (for example, a significant delay in the execution of a work unit required to meet the targets) may have quite serious consequences (for example, business outages, payment of penalties).
However, the management of the workload plan (to monitor the execution of work units and to intervene in an attempt to solve any execution issue relating thereto) is quite difficult. Indeed, the work units generally define complex workload networks defined by their dependencies, shared execution resources and/or common targets.
Therefore, manual analyses of the execution of the work units (for example, by a system administrator) may be ineffective in determining the actual impact of any execution issue on the whole workload plan and in determining possible solutions that might be applied for avoiding the missing of the corresponding targets.
Statistical analyses may also be exploited to facilitate the management of the workload plan. Particularly, the statistical analyses are used to forecast the impact of each execution issue and of each solution on the whole workload plan according to corresponding historical information. However, the statistical analyses are detached from the real-time condition of the computing system in which the workload plan is running.
In any case, the execution issues are managed centrally by a scheduling server, which schedules the execution of the work units onto corresponding execution servers. For example, the scheduling server may provision additional execution servers (such as by allocating corresponding new virtual machines) when no execution server with the required characteristics is available to execute pending work units.
At most, each execution server may intervene locally for addressing the execution issues of the corresponding work units. For example, each execution server may add or reserve execution resources (such as processing power) to one of its work units.
Therefore, the management of the workload plan is generally static and rigid. All of the above significantly reduces the resiliency of the workload management, with the risk of missing the targets of the workload plan (and then with the above-mentioned consequences).