It is fairly common for complex scientific applications, such as life science applications, to run jobs for a significantly long time in supercomputers or clustered environments. It may even take days or weeks for a single job to fully execute. Therefore, it is very costly if such application instances or jobs, after running for days, fail because of a lack of adequate system resources in the computing environment. The lack of system resources might result from inadequate hard drive space, memory, CPU cycles, or a variety of other reasons.
Jobs and application commands are commonly executed in high-performance computer systems with the assumption that the system is capable and available to fully process the jobs. Users and systems typically do not perform a check before executing the job to verify that the job will have adequate resources to finish. Further, once a job has started, there is no way to be alerted to decreasing system resources that will affect the job. Techniques are needed to predict job failure and save valuable time, cost, and resources. Techniques are also needed to implement real-time changes in the computing environment and prevent long-running jobs from failing.