Most storage systems that do not perform overwrite data in place need a mechanism for garbage collection (“GC”) that is, reclaiming storage that is no longer in use while preserving live data. Since the advent of “log-structured file systems”, there has been work to optimize the cost of cleaning the file system to consolidate live data and make room for large contiguous areas of new data to be written. Most past efforts in this area have been to optimize the input/output (I/O) costs, as any effort to read and rewrite data reduces the throughput available for new data.
With deduplicating storage systems there is an additional complication: that of identifying what data is live in the first place. As new data is written to a system, duplicate chunks are replaced with references to previously stored data, so it is essential to track such new references. Approaches that require the storage system to be read-only during GC eventually give way to more complicated real-time reference management, using techniques such as epochs to control referential integrity.
Deduplicating systems face other challenges with respect to GC. As workloads evolve, some systems experience very different usage than traditional deduplicating backup storage systems were intended to support. Advanced backup systems are designed to handle a relatively low number (thousands) of relatively large files (Gigabytes), namely the full and incremental backups that have been the mainstay of computer backups for decades. In addition, the expectation is that the logical space, i.e., the set of files that could be written and read by applications such as a backup application, would only be a relatively small factor larger than the physical space, i.e., the storage consumed after deduplication. Typical deduplication ratios have been assumed to be in the neighborhood of 10-20 times or less, but this has been changing dramatically in some environments. Thus, new technology trends are increasing the deduplication ratio as well as the numbers of files represented in storage systems.
One current system uses a mark-and-sweep algorithm that determines the set of live chunks reachable from the live files and then frees up unreferenced space. There are also other alternatives such as reference counting.
In prior systems, GC was performed at the logical level, meaning the system analyzed each file to determine the set of live chunks in the storage system. The shift to using individual file-level backups, rather than tar-like aggregates, meant that the number of files in some systems increased dramatically. This resulted in high GC overhead during the mark phase, especially due to the amount of random I/O required. At the same time, the high deduplication ratios in some systems resulted in the same live chunks being repeatedly identified, again resulting in high GC overhead. The time to complete a single cycle of GC in such systems could be on the order of several days. Since backing up data concurrently with GC results in contention for disk I/O and processing, there is a significant performance implication to such long GC cycles; in addition, a full system might run out of capacity while awaiting space to be reclaimed.
Therefore, there is a need to redesign GC to work at the physical level: instead of GC enumerating all live files and their referenced chunks, entailing random access to all files, GC performs a series of sequential passes through the physical storage containers containing numerous chunks. Because the I/O pattern is sequential and because it scales with the physical capacity rather than the deduplication ratio or the number of individual files, the overhead is relatively constant and proportional to the size of the system.
The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions. EMC, Data Domain, Data Domain Restorer, and Data Domain Boost are trademarks of Dell EMC Corporation.