Disaster recovery procedures for data management enable an organization to recover data after catastrophic events such as fire, or floods, hurricanes, and other natural disasters that present risk to locally stored data. To improve reliability of disaster recovery and meet stringent recovery time objectives imposed by businesses and organizations, are increasingly replicating backups to create an offsite copy of critical data. Deduplication and compression of replicated data sets reduces the storage and bandwidth requirements for offsite backup replication.
In deduplicated storage systems, multiple stored files or objects may contain and thus reference the same stored chunk of data. In such scenario, the chunk of data is stored only once and metadata is used to link the files or objects to the stored chunk of data. Stored chunks of data must not be deleted until all references to it have been deleted. However, when a chunk of data is no longer referenced, the chunk of data should be deleted to preserve storage space. This process is typically referred to as garbage collection (GC). Determining when there are no remaining references to a chunk of data is a primary task of the GC process. However, the precise process of determining the set of data that is no longer referenced can be complex and resource intensive, particularly for large data sets.