Deduplicated data systems are often able to reduce the amount of space required to store files by recognizing redundant data patterns. For example, a deduplicated data system may reduce the amount of space required to store similar files by dividing the files into data segments and storing only unique data segments. In this example, each deduplicated file may simply consist of a list of data segments that make up the file.
While conventional deduplicated data systems may reduce the space required to store files, the mechanisms used by such conventional systems to manage deduplicated data may present unwanted limitations. For example, since more than one file may reference any given data segment, the data segments that make up a file cannot simply all be removed when the file is deleted. In order to safely delete data segments, a deduplicated data system must distinguish between referenced and unreferenced data segments.
Unfortunately, in some cases a newly added referenced data segment may temporarily appear unreferenced and may therefore be inappropriately deleted. For example, when a deduplicated data system receives a new file, the deduplicated data system may first receive data segments that make up the file and then receive the file itself (e.g., the deduplicated data system may require that the data segments exist before the file can reference them). If a garbage collection subsystem of a deduplicated data system examines the new data segments before observing the new file itself, the deduplicated data system may delete the new data segments.
In some cases, conventional deduplicated data systems may attach a temporary indicator to all newly added data segments in order to ensure that data segments are not prematurely removed. Unfortunately, this solution suffers from a number of deficiencies. For example, adding temporary indicators to all newly added data segments (and then subsequently removing the indicators when the corresponding files are added) may impose a significant performance overhead. Furthermore, this solution may be incompatible with some implementations of a garbage collection subsystem (such as a mark-and-sweep approach or a reference-count approach). Accordingly, the instant disclosure identifies a need for systems and methods for efficiently performing garbage-collection operations in deduplicated data systems.