One of the most important aspects of computing systems is the data. As a result, many data owners ensure that their data is protected. This is achieved by regularly backing up the data. As backup technology continues to advance, backup data sets and backup applications become more sophisticated. For example, many backup sets allow the data to be restored at different points in time. This allows a data owner to understand what the data looked like at different dates.
Another benefit of backup technology is the ability to de-duplicate the backup data sets. An initial benefit of de-duplication is that the storage requirements are reduced. However, de-duplicated data sets introduce new problems that need to be solved.
Conventional approaches to de-duplication illustrate some of the problems associated with de-duplicated data sets. Data in de-duplicated data sets are typically broken into chunks and each chuck is associated with a reference count and a fingerprint (e.g., a hash of the chunk) that uniquely identifies the chunk. The reference count of a data chunk generally identifies how many backups are associated with that data chunk. As backups are added or removed from the backup data sets, the reference count is increased or decreased.
When the reference count of a data chunk is reaches zero, the data chunk and the fingerprint can be removed from the backup data sets. Because data chunks and their fingerprints are removed from the backup data sets, the need to trawl the backup data sets to identify chunks that are not part of any backups (e.g., perform garbage collection) is significantly reduced.
The requirement to maintain reference counts in convention de-duplication systems, however, introduces processes that can take a significant amount of time and that can be extremely slow. For example, the amount of data that changes from one backup to the next is usually a small percentage. Consequently, many backups often share much of the same data. The benefit of de-duplication is that only the changed data in the data set, which is usually a small percentage of the entire data set, needs to be backed up. The drawback of this system is that all of the reference counts (e.g., of all of the chunks or blocks in the backup data sets) need to be updated for each backup and/or each backup removal. When the backup data set is relatively small, this is not a large problem. When a backup data set includes millions or tens of millions of data chunks, the process of increasing or decreasing reference counts for millions of data chunks can take a very long time and can consume significant computing resources.
In another type of de-duplicated storage, reference counts are simply not maintained. In this example, the fingerprints of the chunks may be used to determine whether a data chunk is already present in the backup data sets, but there is no need to update any reference count. The problem with this system is that garbage collection is required to identify and remove data chunks that are not referenced by any backup.
The garbage collection process iterates through all manifest files (potentially thousands of manifest files) in the storage. Each manifest file is associated with a backup data set and each manifest file lists the data chunks associated with that backup data set. Be processing all of the manifest files, the garbage collection process identifies and lists all data chunks referred to in each of the manifest files. Subsequently, all data chunks (often numbering in millions) that exist in the storage are identified and listed. The garbage collection process then removes all manifest files that refer to a data chunk(s) that is not present in storage in one example. In addition, all data chunks that are not referred to by any of the manifest files are removed from the de-duplicated storage. This type of de-duplicated storage system can be very fast when storing or removing a backup from storage. However, the garbage collection process can take several hours or days to perform.
Systems and methods are needed that can reduce that can reduce the time and computing resources associated with maintaining references counts and/or performing garbage collection.