In a deduplicated file system, such as Data Domain™ file system from EMC® Corporation, there are two components responsible to manage the files in the system. The first one is directory manager (DM), which is a hierarchical mapping from the path to the inode representing a file. The second one is content store (CS), which manages the content of the file. Each file has a content handle (CH) that is stored in the inode that is created by CS every time the file content changes. Each CH represents a file that is abstracted as a Merkle tree of segments. A file tree can have up to multiple levels, such as 7 levels: L0, . . . , L6. The L0 segments represent user data and are the leaves of the tree. The L6 is the root of the segment tree. Segments from L1 to L6 are referred to as metadata segments or Lp segments. They represent the metadata of a file associated with a file tree. An L1 segment is an array of L0 references. Similarly an L2 is an array of L1 references and so on.
A segment is considered live if it can be referenced by any live content in the file system. The file system packs the segments into containers which are written to disk in a log-structured manner. Each container is structured into sections. The first section is the metadata section and the following sections are referred to as compression regions (CRs). A CR is a set of compressed segments. In the metadata section there are all the references or fingerprints that identify the segments in the container. A field called content type is also stored therein, which describes the content of the container. For instance, it describes which compression algorithm has been used, which type of segments the container has (L0, . . . , L6), etc. There is a container manager that is responsible to maintain the log-structured container set and provide a mapping from container identifiers (CID) to block offset on disk. This mapping is entirely stored in memory. It also contains additional information, e.g., the content type of each container. Hence, it is easy to traverse the container manager metadata and filter containers to load from disk based on their content type. For instance, processing logic can traverse the entire container set and only read containers that have L6 segments in them.
A cleaning process (also referred to as a garbage collection process) of the file system is responsible for enumerating all live segments in the live content handles of the file system. A physical garbage collector does not understand the concept of file trees. It traverses all the files simultaneously using a breadth-first approach. Hence it cannot roll a per-file-tree checksum that would allow the garbage collector identifying whether any metadata segment is missed. A conventional garbage collection (GC) process scans all the LP containers in multiple times to perform few independent tasks, which is very memory and processing resource inefficient.