As software applications run on larger and larger computer systems, performing longer and longer computations, there is increasing likelihood that one or more computer components will fail during a run. Unfortunately, the ability to efficiently complete extremely large scale computations despite component failures is an unsolved issue. Even applications that include their own recovery mechanisms exhibit excessive failure-free overhead and coordination times that can exceed the mean time to failure. For extremely large scale computations, the overhead and coordination times associated with failure recovery can become so burdensome that it is infeasible to execute the computation.
Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.