1. Field of the Invention
The present invention relates generally to risk management in computer clusters. More particularly, aspects of the invention relate to containing job de-scheduling risks within a target bound by creating backup tasks for heterogeneous tasks with similar resource requirements.
2. Description of Related Art
Computing power required by applications has been increasing at a tremendous rate. By aggregating the power of widely distributed resources, computer clusters permit organizations to boost their processing power through linked computers and a collection of shared resources such as computing nodes, processors, memory, databases, network bandwidth, I/O devices, etc. Heterogeneous application jobs/tasks are allocated with resources and scheduled to run in series or in parallel on different machines by cluster management infrastructure such as resource managers and job schedulers.
In a distributed computing environment, execution failures of the jobs or tasks already scheduled may occur for various reasons such as network failure, machine crash, power failure, overloaded resource conditions, distrusted security policy, competing tasks or jobs, or other incidents that lead to non-availability of scheduled and/or required resources.