In a deduplicated file system, a file may be represented in a file tree having one or more levels of segments in a multi-level hierarchy. Internal nodes of the file tree include fingerprint segments. Only the lowest level nodes (e.g., L0 segments) are the actual data segments containing the actual deduplicated segments. A fingerprint may be a collision-free hash of a segment. For example, an L1 segment may include fingerprints that identify the L0 segments. Similarly, an L2 segment may include fingerprints that identify the L1 segments, and so on.
Multiple files may share a same data segment in a deduplicated file system so long as the fingerprints match. The actual data segments may be grouped and stored in a storage device, e.g., hard disk drive (HDD), as a fixed size container. The fingerprints of the data segments may also be stored in the container and indexed as a {fp, container_id} pair. The container and the index may collectively form a collection partition (CP), which is the data structure that manages the deduplicated file system.
When many expired files are accumulated within the file system, a cleaning service may be executed to remove “dead” segments. As a part of the cleaning service, to ensure data consistency between the container and the index, the service may check to ensure that all the fingerprints in the container can properly identify the data segments based on index information of the index. This, for example, may be performed by computing and comparing checksums of all the fingerprints in the container against checksums of “live” fingerprints in the index.
With the introduction of metadata separated CP, however, performing such cleaning service may pose a challenge as the fingerprints (e.g., metadata) are separately stored from the actual data segments (e.g., L0 segments), which are stored as objects.