The present invention relates to the electrical, electronic and computer arts, and, more particularly, to improvements in scheduling high performance computing (HPC) jobs.
In the context of HPC workloads, user jobs, after waiting long times in queues, frequently fail when executing for several reasons. Among these reasons are hardware failures, software errors, and human errors. Software and/or human errors can include a typographical error in input parameters, libraries not installed properly, and/or missing input files. Such software or human errors can contribute significantly to job failure.
It can be frustrating for a user to wait a long time in queues for her jobs to start and then fail. That user's time is wasted as she needs to verify and/or correct the errors and then wait all over again in the queue to start her job again. Other users may have difficulty planning their jobs because extra load in the cluster results in longer wait times for all users. The computing center owner incurs increases costs (e.g., energy and salary) and reduced user productivity.
Sanity checks on user jobs may avoid such errors and increase the successfully executed job throughput. However, sanity checks can be expensive in HPC environments. Thus, running sanity checks for all jobs with detailed verifications is not cost effective in HPC settings.