Data backup systems (e.g., those made by Dell EMC) may use deduplication of metadata and data to reduce the amount of stored data by orders of magnitude. Typically, stale backups may be deleted according to user-supplied or default retention policies (e.g., backups that are 10 days old or older are deleted). When the deletion of one or more backups results in data or metadata objects being no longer referenced by any of the remaining backups, these “orphaned” objects may be deleted through a process known as “garbage collection.” In other words, metadata and data objects are only kept for certain periods of time as determined by retention policies after they are last present in a backup and are no longer present in later backups. For example, if a metadata object was last seen in a backup that is 10 days old (e.g., the metadata object and the associated data object were deleted by the user 9 days ago, in a daily backup scenario), and the policies dictate that backups be kept for 10 days, the metadata object in the backup system may be deleted. In other words, once a metadata or data object has expired, it is deleted using the garbage collection process.
The traditional mark and sweep algorithm for garbage collection for metadata objects does not scale efficiently with “big data” when the number of objects approaches billions. The “mark” query, which is used to find elements for deletion in the metadata catalog, does an entire table/object scan and counts references. It does this to determine what objects should be deleted. The “mark” query is expensive, computationally and time wise, and does not scale because of the entire table/object scan.
In addition, the “marking” phase in mark and sweep requires that backups be held off so that data used by subsequent backups is not inadvertently “swept” away (i.e., deleted) due to a bad “mark.” A bad “mark” can occur if a backup runs during the “marking” phase. While there are schemes that prevent backups from running such as segregating the backup data into separate containers or indices, these methods add complexity and to some degree negatively affect performance.