In parallel computing systems, such as High Performance Computing (HPC) applications, data storage systems must deal with the increasing amounts of data to be processed. The inherently complex and large datasets increase the potential for data corruption and therefore the need for data integrity. As HPC environments grow to exascale (and larger) by becoming more distributed, the probability of silent data corruption (i.e., corruption unknown to the user) becomes more likely. Such data corruption cannot be allowed to remain silent.
Checksumming is one common technique for ensuring data integrity. A checksum or hash sum is a value computed from a block of digital data to detect errors that may have been introduced during transmission and/or storage. The integrity of the data can be checked at a later time by recomputing the checksum and comparing the recomputed checksum with the stored checksum. If the two checksum values match, then the data was likely not corrupted with a high probability.
A need exists for improved techniques for end-to-end integrity of the data in parallel computing systems, such as High Performance Computing (HPC) environments.