1. Technical Field of the Invention
The present invention generally relates to a field of filesystems in computers. More particularly, the present invention is directed to efficiently checkpointing a filesystem on a distributed-memory parallel supercomputer, thereby facilitating faster execution of applications.
2. Description of the Prior Art
In large computing systems, such as a distributed-memory parallel supercomputer, it is standard to save a state of a system at regular intervals such that a application can be rolled back and rerun from a last saved state of the system, thereby saving time and computing resources. This is necessary because the large computing systems do not have the reliability of small computing systems and the applications that utilize the large computing systems often run for hours, days or weeks. More particularly, the large computing systems may crash or be brought down for maintenance, exceptions may be encountered while an application executes, or programmer-defined conditions may be met which terminate the application. Since the application manipulates disk files, the roll back restores the manipulated disk files to a previous clean state, i.e., to a previous checkpoint. Therefore, checkpointing the application mitigates the rerunning of the application since the application need only be rolled back to the previous checkpoint rather than be rerun from the start. Thus, checkpointing of the filesystem is a critical aspect of checkpointing in the large computing systems, such as the distributed-memory parallel supercomputer.
A main motivation for a distributed-memory parallel supercomputer is a fast execution of the application. Thus, during the execution of an application on the distributed-memory parallel supercomputer, there is a need for a checkpointing filesystem, which is not significantly slower than a filesystem without checkpointing. Similarly, there is a need for a checkpointing filesystem in which the act of checkpointing is fast, since during checkpointing of the filesystem the application is not executing. There also is a need for a checkpointing filesystem, which appears to the application as a normal filesystem and does not complicate the implementation of the application.
Therefore, there is a need in the art for providing a checkpointing filesystem on the distributed-memory parallel supercomputer that facilitates faster execution of an application executing on the distributed-memory parallel supercomputer.