Since the past few decades the computational power of computing devices has increased exponentially. The advancement in hardware, whether in terms of faster, stronger processors or bigger memory, is equally matched by the amount of data to be processed. The amount of data to be processed is so huge at times that a single computer may take days or even years to finish a task.
One of the methods employed to overcome the aforementioned problem is to use parallel computers. Parallel computers are two or more serial computers connected to each other in a particular configuration which communicate with each other for input and output requirements of data to be exchanged between the processes running on them. Using parallel programming, a problem is divided into one or more tasks that can run concurrently. Thereafter these tasks are distributed over a cluster of parallel computers for execution. Once all the tasks have been executed, the results are collated and presented. The use of parallel computing with parallel computers provides a greater memory and Central Processing Unit (CPU) resources to execute a job, thereby reducing the turnaround time for its completion.
Parallel computers broadly follow two types of architectures, i.e., distributed memory and shared memory. In distributed memory architecture, the serial computers have access to their own memory and communicate with other nodes through a communication network.
In shared memory architecture, multiple processing units share a common memory space using a high-speed memory bus.
For a job (i.e., a problem) to run on a parallel computing architecture, it must first be broken down into smaller problems that can run on different processors simultaneously. The smaller problems are referred to as job units. Thus a job consists of two or more job units. A single job unit is a process, which executes concurrently with other job units.
The shared memory architecture does not scale-up well for large jobs i.e., those which comprise of a few tens of job units. On the other hand, the distributed memory architecture on the other hand, does scale well to allow similar large jobs to execute with good performance.
Distributed memory parallel computers use the message passing programming paradigm in which the programmer has to explicitly divide data and work across the processors as well as manage the communication between them.
Message Passing Interface (MPI) implements the “message passing” model of parallel computing. The message-passing model comprises of:                a number of processes running on local data. Each process has its own local variables and does not directly access memory of other processes,        sharing of data between processes takes place by passing messages, i.e. explicit sending and receiving of data between processes.        
The processes may or may not run on different processing machines. The advantage of the above process is that it provides more control over the flow of data, and can be implemented across a variety of platforms. MPI is usually implemented as a library of functions or subroutines that are inserted in the source code to perform data communication between the processes.
Once a job has been divided into two or more units and allocated some hardware resource from the job execution infrastructure, it changes several states till the time of its completion, These states may include for example, “initial”, “loaded”, “running”, “stopped”, “terminated”, etc., The transition between states is handled by Mpirun. Mpirun is a job control program which the user uses to launch their jobs on the execution platform. Each job has a corresponding Mpirun program that controls the state transitions of the job. As stated above, a job has two or more job units that run parallel to each other. At any given time, a job i.e. all the job units belonging to the job have the same state. All the job units for a particular job change the state together when instructed by the job control program.
The advantage of using Mpirun to control parallel jobs is that the mpirun can run and control the job from a remote location. It is possible for Mpirun and a parallel job to run on a physically different hardware. The parallel job usually runs on a much more robust hardware than the Mpirun. For instance, one can run a parallel job on the servers in a supercomputing facility, but the Mpirun that controls this job may run on a standard office computer. Not only is the office computer built of much less robust hardware, and is sensitive to various aspect of problems, such as power interruption, but one is also exposed to the network “hiccups” between the office and the servers on which the job is running on.
As stated before, Mpirun is responsible for the state transitions of its respective job. Consequently Mpirun also holds the present state of the job it controls.
If, for some reason, the connection to the Mpirun program is lost, for example, if Mpirun is killed, there is no way to reconnect to the job, and the infrastructure takes immediate action to terminate the job. This can be a problem for long executing jobs that are terminated after running for a significant amount of time, losing all their work and wasting resources and time, just because their connection to Mpirun is lost.
In addition, Mpirun needs to closely interact with the job execution infrastructure in order to perform job state transitions, which means that different infrastructures use different Mpirun programs customized for their own specific implementation. This tight coupling between Mpirun and the infrastructure does not allow Mpirun to be easily ported and used on other execution infrastructures.
The aforementioned problems make the parallel job execution architecture prone to faults, as an early-terminated job needs to be executed again from start. Various approaches have been taken in the past, in order to stabilize parallel job execution architecture.
It is known in the art of parallel job processing to store information related to the job execution state in a table. Before terminating a job, information regarding its state is stored. The information is used at a later stage to restart the job. However this approach relates to the problem of saving the ‘running’ state of jobs when the jobs are in the middle of execution. The approach does not take care of the logical state of the job.
Re-executing an early-terminated job results in wastage of all the computing resources that it had used until it was terminated. ‘Periodic Checkpointing’ is one solution to reduce such resource wastage. In ‘periodic checkpointing’ an ‘image’ of the job (i.e., snapshot of all job units) is saved to a disk at periodic intervals of time. These images can be later restored, in case the job gets terminated because Mpirun is killed, to restore its state to the last checkpoint, and allow it to continue executing from that point, reducing the waste of resources.
The problem with periodic checkpointing is that with the current capacity of today's parallel computers, a parallel job may be composed of a huge number of processes (job units). In such a scale, the time and amount or storage required to periodically checkpoint such a job makes periodic-checkpointing a practically infeasible option.
Therefore there is a need for a system for fault tolerant execution of parallel jobs, that allows the jobs to continue uninterrupted execution even if their control program i.e., Mpirun gets killed or the connection to Mpirun is lost, in order to minimize the waste of resources and time caused by the early termination of the jobs.