This Invention broadly relates to data processing techniques, and more particularly, to a technique for job scheduling within a distributed processing system having one or more subsystems with scheduled or unscheduled
Downtime for maintenance of a subsystem is usually a necessary part of any computer system operation. A job scheduling system typical oversees and is an integral part of this downtime maintenance. Conventionally, shutting down the job scheduling system prevents further jobs from being started while the system is undergoing maintenance. One problem, however, which is particularly relevant in a distributed processing system, is that jobs might have already been started whose execution will extend into a scheduled downtime of a needed subsystem. These jobs might start, according to normal schedule or operation, hours before the subsystem""s downtime begins. However, once the downtime begins, any remaining jobs would normally be canceled by a system administrator.
Thus, an automated processing technique is needed for a job scheduler to determine whether a job to execute within a parallel processing system should be commenced notwithstanding a scheduled or unscheduled downtime of one or more subsystems required by the job.
One approach to downtime job protection is to simply prevent jobs from starting as the jobs approach the downtime. This concept, however, has several disadvantages. First, certain users, i.e., those with application level checkpointing in their parallel job, will want to run their job even if the tail end of the code is scheduled to finish within the downtime. With application level checkpointing, running for a few hours before a scheduled downtime provides the user with an opportunity to obtain earlier results since the checkpointing allows restarting of the job essentially where it left off. Another disadvantage of simply preventing jobs from starting as they approach the downtime, is that different subsystems, such as a batch scheduler, parallel file system and high performance storage system, within a distributed processing system may experience independent downtimes. Simply waiting for a downtime of one of these subsystems would not protect the jobs from downtimes in the other subsystems. In view of these disadvantages, a different approach to downtime protection than simply preventing all jobs from starting is needed and is provided by the present invention.
Briefly summarized, the invention comprises in one aspect a method for processing jobs within a distributed processing system. The method includes: determining that a subsystem of the distributed processing system has a downtime; determining at least one of a start time and an end time of a job to be executed using the subsystem; determining whether the start time or the end time of the job is within the scheduled downtime, and if not, placing job in an eligible job list; and making a decision whether to start the job using the eligible job list.
In another aspect, the invention comprises a system for processing jobs within a distributed processing system. This system includes a scheduler module for controlling scheduling of a job for execution within the distributed processing system. The scheduler module includes computer code for: determining a downtime for a subsystem of the distributed processing system; determining at least one of a start time and an end time of a job to be executed using the subsystem; determining whether the start time or the end time of the job is within the downtime, and if not, placing the job in an eligible job list; and making a decision whether to start the job using the eligible job list.
In a further aspect, an article of manufacture is provided which includes a computer program product comprising computer usable medium having computer readable program code means therein for use in processing jobs within a distributed processing system. The computer readable program code means in the computer program product includes: computer readable program code means for causing a computer to effect determining a.downtime for a subsystem of the distributed processing system; computer readable program code means for causing a computer to effect determining at least one of a start time and an end time for a job to be executed using the subsystem; computer readable program code means for causing a computer to effect determining whether the start time or the end time of the job is within the downtime, and if not, for placing the job in an eligible job list; and computer readable program code means for causing a computer to effect making a decision whether to start the job using the eligible job list.
To restate, using the components of the present invention, different downtimes can be designated for different subsystems of a distributed processing system, at the discretion of a system administrator. As downtimes approach, jobs begin to be excluded from scheduling consideration since, if started, they will run into the downtime of a needed subsystem. As the downtime starts, the only jobs remaining running on the system are those to whom the downtime is unimportant, either because they are taking advantage of application level checkpointing or because they don""t use the subsystem which is being stopped. Thus, jobs no longer have to be terminated once a subsystem downtime begins only to be restarted after the downtime ends. Users can elect whether to use the time prior to a scheduled downtime to initiate a job, and the existence of a downtime is no longer an all-or-nothing event. Individual subsystems can be selected for downtimes leaving the remaining system resources available for jobs which can put them to use. Site-specific downtime protection for local subsystems can also be implemented.