Job scheduling systems provide a centralized system in hardware and software for processing large scale tasks. Typically, large scale tasks are broken down into several smaller tasks which are individually, sequentially and coincidentally executed according to a master task schedule to complete the task. For example, for a large company, a job scheduling system may be used to process its payroll payment. The payroll task may be broken down into the following processes: (i) access the company database for a list of the employees; (ii) execute a payroll program to identify salary payments to be made; (iii) execute a deposit program to make electronic bank deposits of the salary payments to the bank accounts of the employees; (iv) execute a report generator to print and send pay stubs to the employees. The payroll payment may be automated to occur at regular intervals. A job scheduling system utilizes a series of agents generally operating on computers to perform the smaller tasks. A workload manager controls each of the agents. The manager is connected to the agents using a communication network configured in a “star” pattern, with the workload manager at the center and each of the agents on a ray of the star.
In a network based system, agents operate on separate computers and each of the computers communicate to a central computer running the workload manager. IP is a commonly used communication protocol. The workload manager needs to track each agent for its status and job completion. When an agent has a fault, e.g. its communication link is broken, the workload manager must be able to recognize the fault and take corrective action, if possible. For example, upon the detection of a fault in an agent, a backup agent on a different computer may be brought in to take the place of the agent. With large systems having many tasks, it is a non-trivial exercise for the workload manager to track and manage the operation of all of the agents.
With a “star” network configuration, if the agents are allowed to move, there is a need to keep accurate data about the IP addresses of the agents. In practice it is a labor intensive task.
Failover of agents (i.e. providing backup for agents) is difficult to achieve because communication between agents and the workload manager is restricted to one instance of an agent and the workload manager. Further, it is difficult to maintain “shadow” agents and to re-assign schedules to different agents after a schedule has been created.
Under heavy load conditions, the workload manager may be overloaded with events. If it cannot process job events, they are queued and processing is delayed. This reduces overall productivity and reduces utilization of enterprise tools.
Also, the “star” architecture creates a performance bottleneck and a single point of failure. If the manager is down, no workload can be executed at all. Jobs that have to run at the time of failure will be delayed.
There is a need for a system and method which addresses deficiencies in the prior art.