Parallel storage systems are widely used in many computing environments. Parallel storage systems provide high degrees of concurrency in which many distributed processes within a parallel application simultaneously access a shared file namespace.
Parallel computing techniques are used in many industries and applications for implementing computationally intensive models or simulations. For example, the Department of Energy uses a large number of distributed compute nodes tightly coupled into a supercomputer to model physics experiments. In the oil and gas industry, parallel computing techniques are often used for computing geological models that help predict the location of natural resources. Generally, each parallel process generates a portion, referred to as a data chunk, of a shared data object.
Checksumming is a common technique to ensure data integrity. A checksum or hash sum is a fixed-size computed from a block of digital data to detect errors that may have been introduced during transmission or storage. The integrity of the data can be checked at any later time by recomputing the checksum and comparing the recomputed checksum with the stored checksum. If the two checksum values match, then the data was likely not altered.
Existing approaches apply checksums on the shared data object after it has been sent to the storage system. The checksums are applied to offset ranges on the shared data object in sizes that are pre-defined by the file system.
In parallel computing systems, such as High Performance Computing (HPC) applications, the inherently complex and large datasets increase the potential for data corruption and therefore the need for data integrity. A need therefore exists for parallel techniques for generating the checksum values and for verifying the integrity of the data.