Reclaiming space in de-duplicated file systems is challenging because files are sharing data segments (or data chunks). Each file is seen as sequence of data segments and the segmentation is done using content-based segmentation techniques. The de-duplication unit is a segment that is identified by fingerprints computed on its content using a standard hashing technique (e.g., SHA1 or SHA2). There will be many segments that are packed into data blocks (also referred to as containers). A data block is the storage unit. Therefore, once a file is deleted one can only reclaim space for its data segments if they are not shared with any other file. The question is: how do we efficiently keep track of the segments that are no longer used? This is a totally different problem when compared to a traditional or non de-duplicated file system where each file has its own blocks and once a file is deleted its blocks can be reclaimed right away.
Typically, a storage system includes a garbage collector that deals with the problem of reclaiming space of a deduplicated file system. Garbage collection is a well-known problem for reclaiming internal memory that is not being used anymore. It became fairly popular with the programming language Java. On the context of file systems it has not been so popular until the advent of log-structured file systems. A deduplicated file system, such as Data Domain™ deduplicated file system from EMC® Corporation, is a log-structured file system that implements a mark-and-sweep technique. A mark-and-sweep garbage collection approach consists of two steps: (i) mark all segments being used in the file system as alive; (ii) sweep off all unused segments and free up the space they were taking up.
A conventional garbage collection process is to mark and sweep the segments of files in a depth-first approach, in which each file tree of segments representing a file is traversed from a top level to a bottom level in a file-by-file manner. Some of the segments may be refereed or shared by multiple files. Such an approach may have to repeatedly traverse the same segment or segments for multiple files. As a result, the garbage collection time is proportional to the size of the logical space stored in a deduplicated file system rather than being proportional to the amount of metadata stored in the system. Garbage collection time is very sensitive to the locality of the metadata in the system. Garbage collection time is very sensitive to the number of small files (instead of larger files) stored in the file system. These problems are becoming recurrent as a conventional file system is used for mixed workloads that the original design had not foreseen. Hence it is important to build a garbage collector that is resilient to those problems so the file system can be cleaned in timely fashion.