The present disclosure generally relates to fault-tolerant computing, and more particularly relates to a method and system for resilient computer programming frameworks for handling failures in executing parallel computer programs.
Failures in executing computer programs constitute a significant problem. The problem is compounded in multiprocessor environments where failure of a single processor can cause a computation to fail, requiring it to be run from scratch.
In recent years, frameworks such as map reduce (Hadoop is a well-known implementation, http://hadoop.apache.org/), Spark (https://spark.apache.org/) and Pregel (“Pregel: A System for Large-Scale Graph Processing”, Malewicz et al, Proceedings of SIGMOD 2010, http://kowshik.github.io/JPregel/pregel_paper.pdf) have been introduced which provide some degree of resilience to failures. A main drawback to these previous approaches has been that they were only applicable for applications which follow certain regular patterns. There are many applications which do not fit within the paradigms of map-reduce or Pregel.
MPI (http://www.mcs.anl.gov/research/projects/mpi/) has been often used to program parallel computing systems. However, while MPI has provided message-passing support, it has not provided a full-fledged programming environment. Instead, it was designed to be used in conjunction with existing programming languages such as C, C++, Fortran, Java, etc.
There is thus a need for more general frameworks which help programmers write resilient programs.