1. Field of the Invention
The present invention relates generally to intelligent software systems. More particularly, the present invention relates to computer systems for coordinating, distributing, and managing other software programs on a network.
2. Description of the Related Art
The computer and computer software fields are experiencing a great explosion in technological growth. The rapid generation of increasingly complex computer technology can be seen as both a boon and a bane. For instance, increasingly powerful computers and the highly complex computer programs that operate thereon provide benefits on a scale previously unseen. Computer operators are now provided with tools that achieve tasks in a fraction of the time previously required, if indeed those tasks could previously have been performed at all.
Nevertheless, this increasing sophistication comes at a price. For instance, the increasingly sophisticated computer programs now available require large amounts of specialized user training and accustomization in order to provide productivity gains. Additionally, installing, maintaining, and using such programs is effectively becoming an increasingly daunting task.
Prior art computer operating systems are provided with a job management capability that allows users to run more jobs in less time by matching the jobs' processing needs with the available resources. The prior art systems are configured to schedule jobs, and provide functions for building, submitting, and processing jobs quickly and efficiently in a dynamic environment.
A network job management and job scheduling system is a software program that schedules and manages jobs that a user submits to one or more machines under its control. These systems then accept jobs that users have submitted and review the job requirements. The machines under the control of these systems are then evaluated, and the machine best suited to run each job is chosen. These systems execute each step of a job on a machine that has enough resources to support executing and checkpointing each job step.
Prior art operating systems are also configured to accept submission of batch jobs for scheduling. Batch jobs run in the background and generally do not require any input from the user. These batch jobs are typically classified as either serial or parallel. A serial job runs on a single machine, while a parallel job is designed to execute as a number of individual, but related, processes on one or more of the system's nodes. When executed, these related processes can communicate with each other through message passing or shared memory to exchange data or synchronize their execution.
Once a machine with suitable resources has been selected, the job is dispatched to the appropriate machine. Prior art systems are configured with queues. In this description, a job queue refers to a list of jobs that are waiting to be processed. When a job is submitted by a user, the job is entered into an internal database, which resides on one of the machines, until it is ready to be dispatched to run on another machine.
Once a job has been dispatched to a machine to be processed, the job runs and is executed. A job can be dispatched to either one machine or, in the case of parallel jobs, to multiple machines. In many prior art systems, jobs do not necessarily get dispatched to machines on a first-come, first-serve basis. Requirements of the job, characteristics of the job, and the availability of machines are examined, and then the system determines the best time for the job to be dispatched.
Computer operating systems include several different operational modules. One such module is a software module responsible for coordinating, distributing, and managing job requests being run on a network. Although this type of module may have different names depending upon which operating system it is contained within, the term “agent” shall be used herein to refer to such a job-dispatch system. The agent is responsible for coordinating, distributing, and managing job requests being run on a network. However, problems arise if the computer station in which the agent resides requires maintenance or other downtime. Currently, job requests are either terminated or left incomplete if maintenance or other downtime occurs on the system containing the agent.
Often, a system administrator will submit highly complex computational tasks which require certain hardware to be present in order to be completed. If an agent does not possess the required hardware, the task may be terminated, completed incorrectly, or lost. This may also cause the machine hosting an agent to experience downtime and require maintenance. Any job requests currently located on an agent which experiences downtime or requires maintenance are either terminated or left incomplete as explained previously.
Computer systems remaining in the network which do not necessarily host an agent are considered clients. These clients are responsible for receiving the job requests from the agent, executing the job requests, and returning the result of the job request to the agent. Often, the client selected to complete the requested job is chosen automatically.
Consider the situation in which a system administrator submits a job request to be carried out by a client station within a network. The agent receives the job request and then selects which client to submit the job request to. The client selected may already have numerous job requests waiting in the queue which need to be completed. The newly submitted request will remain on the clients' queue until previously submitted requests are completed. This poses a serious problem if the newly submitted job request is of high importance and needs to be completed immediately.
A similar problem may arise if the job request submitted to the selected client is too complex for the client to execute. For instance, a job may be too complex if it entails computation that would be computationally prohibitively expensive or slow to finish or a if it requires resources such as RAM or disk space which exceed the resources installed on the client. This may cause a client to become overloaded and terminate and/or lose other job requests located in the client's queue.
A disadvantage of current software control agents is the generally limited capability of the agent with respect to managing and monitoring the state of current job requests. Furthermore, current software control agents provide no mechanism for providing manual or automatic relocation of an entire agent and job request. This lack of mobility often results in incomplete and/or unsuccessful job request completion.