Currently, enterprises, data centers, and other large scale networks often employ storage devices implemented using flash drives with high storage density (also referred to as solid state drives (SSDs)). Such networks must support high throughput and high input/output operations per second (IOPS). Given the tremendous amount of data being processed, some data loss is inevitable. For example, the active write process where data is written from the host to the drive and the background write process where garbage collection takes place can lead to data loss on the flash. To guarantee data accuracy and quality, the infrastructure operator typically needs to support real-time data consistency detection.
A conventional technique for data consistency detection is cyclic redundancy check (CRC). To generate a CRC at the file level, all data in a file is considered as a data stream. N zero bits are appended at the end of the user data of the file (N is an integer >0). A suitable polynomial of (N+1) degrees is selected based on the length of the data stream. The data stream is divided by the (N+1)-degree polynomial from the left to the right, resulting in an N-bit remainder (the CRC), which replaces the N zero bits in the user data. The data stream is then written into storage. When the file is read, its data stream is divided by the same (N+1) degree polynomial, and the N-bit remainder is generated at the end. If the remainder is an all-zero sequence, the file is deemed to be correct; otherwise, the file is deemed to contain error.
The CRC technique requires reading all data out then calculating the remainder for a file. Because the CRC process is computationally intensive and occupies large amounts of server processors and memory, it is unsuitable by itself for large scale storage systems. Further, while the CRC technique can detect errors, it does not pinpoint the location of the errors. A more efficient and more accurate data verification technique is needed.