Electronic file systems store various types of objects, such as files and file metadata, in “clumps” of memory. As used herein, a “clump” is any range of contiguous memory. For example, in persistent storage devices, a clump may comprise a physically or logically contiguous set of disk blocks. When an object is initially stored in persistent storage, the file management system writes the data that makes up the object into specific clumps within persistent storage, and generates metadata for the data object. The metadata may, for example, map an object identifier of the object to the one or more clumps that store the data of the object. For example, if object file1.txt is stored in clumps a, b, and c, then the metadata for file1.txt would contain mapping information that maps the object identifier of file1.txt to clumps a, b, and c. The metadata that maps object identifiers to clumps is referred to herein as object-to-clump mapping information.
In some systems, a single clump may be mapped to multiple objects. This may occur, for example, when the file management system implements deduplication techniques. Deduplication is a compression technique for eliminating duplicate copies of data. For example, assume that a portion of file1.txt is identical to a portion of file2.txt. Under these circumstances, the data for that identical portion may be stored in a single clump, and the object-to-clump mappings of both file1.txt and file2.txt may point to that same clump. By allowing many-to-one relationships from objects to clumps, the file management system reduces the need to store duplicate copies of the same data. However, at the same time, many-to-one relationships from object to clumps complicate the process of storage reclamation. Specifically, when file1.txt is deleted, the clumps that were used by file1.txt cannot necessarily be reclaimed because one or more of those clumps may be pointed to by a different live object.
When accessing stored objects, the file management system refers to the object-to-clump mapping to determine which clumps belong to a specific object. Once the desired clumps are identified, the file management system accesses the clumps from persistent storage. When deleting and/or updating an object, the file management system removes the object-to-clump mapping between the object and the one or more clumps that contain the object's data, but does not reclaim those clumps. Consequently, the clumps that were mapped to an object are no longer reachable by applications through the object identifier of that object. However, in systems that support one-to-many mappings of object identifiers to clumps, those same clumps may still be reachable through the object identifiers of other objects. The technique of deleting only the object-to-clump mapping without reclaiming the clumps is efficient in that it only requires removing object-to-clump mappings. However, the clumps that are no longer in use have to eventually be reclaimed so that they may be used for new data.
As the objects of a file system are updated and deleted, the file system accumulates clumps that contain data that is no longer used by any software application (referred to herein as “dead data”). A clump that contains dead data is referred to herein as a “dead clump”. For example, assume that an object O is initially stored in clumps A, B and C. If object O is updated, the updated version of object O may be written out to clumps X, Y and Z. In response to the update, the metadata that maps object O to clumps A, B and C is deleted, and metadata that maps object O to clumps X, Y and Z is created. Assuming that no other object is mapped to clumps A, B and C, clumps A, B and C will be dead after the update. That is, the data in clumps A, B and C is no longer in use by any software application. Effective storage management requires that those dead clumps be reclaimed so that the clumps may be reused to store new “live” data.
One technique for reclaiming dead clumps is referred to as the mark-and-sweep approach. The mark-and-sweep approach involves (a) identifying the objects that are currently “live” or “reachable” by software entities, (b) marking the clumps that store data for those objects as “live”, and then (c) freeing up all clumps that are not marked as “live”.
In some implementations, to reduce storage fragmentation, the live clumps within a logically or physically contiguous storage region are moved to a different location so that the entire contiguous storage region is made available for reuse. The process of moving the live clumps of a region to another location is referred to as a copy-forward operation.
As mentioned above, the first step of a mark-and-sweep approach involves identifying live clumps based upon whether a reference to a clump exists within the object-to-clump mapping metadata. Approaches to identifying live clumps include scanning the entire list of object-to-clump mapping, and marking all clumps within the object-to-clump mapping. Marking may include either generating a list of live clumps or setting a marked attribute within the clump itself. During the copy-forward operation, the reclamation process steps through all clumps stored within an area of persistent storage and, if a clump is marked as live, the clump is copied forward to the new area of persistent storage. Once all clumps within the area have been stepped through, the area is reallocated, since all live clumps have already been copied forward.
Unfortunately, the mark-and-sweep approach can consume a significant amount of compute and memory resources, especially when combined with copy-forward operations. For example, if the object-to-clump mapping is relatively large, then scanning the object-to-clump mapping for live clumps may take a considerable amount time to complete. As storage systems get larger, this approach takes more and more time to complete and becomes inefficient. Therefore, an efficient approach to determining live clumps and copying forward the live clumps to new areas of persistent storage is desired.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.