Job scheduling systems typically provide centralized hardware and software facilities to process large scale, distributed tasks. Existing job scheduling systems tend to break down large scale tasks into several different component jobs, which can then be executed individually, sequentially, in parallel, or otherwise, depending on a master schedule defined for the overall tasks. For example, a job scheduling system may be used to schedule a task that relates to booking a travel plan. In such an example, the job scheduling system may perform the task for booking the travel plan according to a schedule of component jobs, which may include booking a flight, booking a hotel, and booking a rental car, among other things. Depending on the nature of a given job scheduling system, each of the component jobs in a task schedule may be managed using different machines connected to the job scheduling system through one or more networks. When processing task schedules that often include different component jobs, which may be managed using many different machines or systems, which may be further distributed in any number of ways, external events can often cause one or more of the jobs in the schedule to fail, potentially causing significant impact to the results of the entire task schedule. For example, after all of the bookings have been made for a travel plan as described in the task schedule mentioned above, a change to just one of the bookings in the travel plan may require new bookings for some or all of the remaining bookings (e.g., a cancellation of the flight booking may require not only a new flight booking, but also new bookings for the hotel, rental car, or other components of the travel schedule).
However, existing job scheduling systems do not adequately process dynamic changes or handle error recovery in cases such as the exemplary scenario described above. Rather, existing job scheduling systems tend to only have rudimentary, hard-coded recovery techniques to deal with such failures. For example, traditional event-driven job scheduling systems (e.g., enterprise job scheduling systems, business process execution systems, workflow execution systems, etc.) have the tendency to control and monitor the execution of large scale tasks by scheduling component jobs in response to the occurrence of various events (e.g., job completion events, changes in an immediate scheduling environment, system events, etc.). Error recovery routines in these types of systems are typically embedded within process descriptions at design time, and the routines are then invoked as necessary at execution time in order to automate error recovery processes. In many cases, however, it will be extremely difficult to anticipate every possible source of failure, and moreover, existing job scheduling systems require detailed analysis of log information at execution time to correctly identify appropriate recovery measures. However, existing job scheduling systems do not easily automate such analysis, which may result in error recovery being a task better suited for human experts to handle in existing job scheduling systems.
Accordingly, for at least the reasons given above, existing job scheduling systems suffer from various problems and drawbacks, including the inability to automate recovery when failure of one or more jobs in a task schedule occurs.