Data storage utilization is continually increasing, causing the proliferation of storage systems in data centers. In order to reduce storage space of a storage system, deduplication techniques are utilized, where data objects or files are segmented in chunks and only the deduplicated chunks are stored in the storage system.
At the time of data recovery by clients, there is a need to validate data integrity as data are read back from a storage system (e.g., backup storage system). Some data would include integrity verification data such as checksum data inside the data stream for data integrity verification during the restoration. However, such integrity verification data may cause seriously degraded deduplication of the actual data stream at the storage system. Such data integrity check should survive backup data migration among different tiers of storage, while its operation efficiency should be guaranteed with random data access and deduplication effectiveness at the storage systems preserved.
In addition, a data stream such as a backup data stream typically consists of a sequence of data objects or files. Typically, storage systems are unaware of individual file boundaries that can be used as heuristics for segmentation, thus segmentation boundaries tend to randomize across file boundaries. As a result, such a configuration would affect performance and/or deduplication effectiveness, because a file is often represented and accessed as a whole unit during backup and synthetic operations.