In a deduplicated file system, such as Data Domain™ file system from EMC® Corporation, files can be moved from a source tier to a target tier (e.g., from an active tier to a cloud tier) for long term retention based on file system policies.
Typically, files can be moved from a source tier to a target tier using a file-based data movement or a physical or bulk data movement (i.e., Seeding). The file-based data movement requires logically enumerating each file's segment tree to filter out segments already existing on a target tier. Since this involves random I/O operations, it can be very inefficient when the target tier is empty or when migrating generation-zero data. The seeding method performs sequential I/O operations by physically moving containers that are associated with files to be migrated in a sequential order, and is generally more efficient than the file-based data movement.
In the seeding method of data-movement, data segments belonging to all files selected for migration are transferred collectively, and can rely on bits in a data structure (e.g., perfect hash vector) to detect data inconsistency. However, if the data movement was suspended due to preemption by a garbage collector or a system crash, information of bits reset in memory would be lost. Therefore, there is a need for an alternative way of validating data consistency in the above scenario.