High performance computing (HPC) systems are topically used for performing calculations of complex mathematical and/or scientific information. Such calculations may include simulations of chemical interactions, signal analysis, simulations of structural analysis, etc. Due to their complexities, it is often significantly time consuming (e.g., hours, days, weeks, etc.) for HPC systems to complete these calculations. Errors such as hardware failures, application bugs, memory corruptions, system faults, etc. can occur during the calculations and leave computed data in a corrupted and/or inconsistent state. When such errors occur, HPC systems restart the calculations, which could significantly increase the processing time to complete the calculations.
To reduce processing times for recalculations, checkpoints are used to store versions of calculated data at various points during the calculations. When an error occurs, the computing system restores data from the most recent checkpoint and resumes the calculation from the restored checkpoint. In this manner, checkpoints can be used to decrease processing times of calculations by keeping a system from having to completely restart a calculation.