Executing critical/real-time scientific applications, such as seismic data processing, three-dimensional reservoir uncertainty modeling, and simulations on distributed computing systems (e.g., homogeneous (clusters), heterogeneous (grid and cloud), etc.) with thousands of scientific applications processes requires high-end computing power that can require days or weeks of processing data to generate a desired solution. The success of long job execution depends on the reliability of the system. As most scientific applications deployed on supercomputers can fail if only one of its processes fail, fault tolerance in distributed systems is an important feature in complex computing environments. Tolerating any type of computer processing failure reactively typically involves a choice whether to allow periodic checkpointing of the status of one or more processes—an effective technique widely applicable in high-performance computing environments. However, this technique has overhead concerns associated with selecting optimal checkpoint intervals and stable storage locations for checkpoint data. Additionally, current failure recovery models are typically limited to a few types of computing failures and manually invoked in case of computing failure(s) which limits their usefulness and efficiency.