Storage systems store large volumes of data. In a batch computing system, multiple jobs are submitted into the cluster for execution. These jobs will read some number of files from a shared file-system and write other files back to the shared file-system.
A problem occurs when multiple jobs accidentally read and write the same file without proper synchronization. This situation generates a race condition in the file system and the results of the jobs become ambiguous. Another problem occurs when users or applications change the files while the jobs are still executing. In some instances, the end result of the completed job does not include the change. This situation occurs when a user changes or updates a file that has already been read during execution of the job.
As one solution to these problems, users inspect the final output of the jobs. If some portion of the completed job does not look accurate, then the jobs are manually resubmitted. Manually inspecting the jobs relies on a person noticing that something about the completed job does not look right. This process is time-consuming, and users may not notice all inaccuracies.