Deduplicated data systems are often able to reduce the amount of space required to store files by recognizing redundant data patterns. For example, a deduplicated data system may reduce the amount of space required to store similar files by dividing the files into data segments and storing only unique data segments. In this example, each deduplicated file may simply consist of a list of data segments that make up the file.
While conventional deduplicated data systems may reduce the space required to store files, the mechanisms used by such conventional systems to manage deduplicated data may present unwanted limitations. For example, since more than one file may reference any given data segment, the data segments that make up a file cannot simply all be removed when the file is deleted. In order to safely delete data segments, a deduplicated data system must distinguish between referenced and unreferenced data segments.
In some cases, conventional deduplicated data systems may use bilateral referencing systems in order to ensure that data segments are not prematurely removed. For example, each file in a conventional deduplicated data system may include a list of data segments that make up the file. Likewise, each data segment within the deduplicated data system may maintain a list that identifies each file within the system that references the data segment. The deduplicated data system may use the lists maintained by both the files and the data segments to identify unreferenced data segments (i.e., data segments that are no longer referenced by any of the files in the deduplicated data system) that may be removed from the system.
Unfortunately, the bilateral referencing systems used by many conventional deduplicated data systems suffer from a number of deficiencies. For example, when a file in a conventional deduplicated data system is updated, the system may need to update both the referential list maintained by the file and the referential list maintained by each data segment referenced by the file. The process of creating and updating two referential lists may be both time consuming and resource intensive.
In other examples, conventional deduplicated data systems may use mark-and-sweep systems in order to ensure that data segments are not prematurely removed. For example, a deduplicated data system may check each data segment to see if that data segment is referenced by any file in the deduplicated data system. In this example, if a mark-and-sweep system finds a file that includes the data segment, the mark-and-sweep system may mark the data segment as referenced. The mark-and-sweep system may then sweep the deduplicated data system for unmarked data segments and delete the unmarked data segments. Unfortunately, a brute force approach of checking each data segment may also be time consuming and resource intensive. Accordingly, the instant disclosure identifies a need for efficiently marking and sweeping unreferenced data segments in deduplicated data systems.