1. Field
This application relates generally to data storage, and more specifically to a system, article of manufacture and method of methods and systems of a distributed garbage collection for the dedupe storage network.
2. Related Art
It is noted that conflicts can arise when a garbage collection (GC) operation is running on a site while other sites in the dedupe storage network concurrently begins uploading data to said site. A conflict can also arise when the onsite starts downloading data from another site. For example, the GC is in ‘data gathering’ state and a replication site is already uploading data. The replication site may not be able complete the data upload before GC changes its state to ‘data deletion’. In another example, GC can be in a ‘data gathering’ state and the onsite is already downloading data. Accordingly, it may not be able to complete the download before GC changes its state to ‘data deletion’. In a ‘data gathering’ state, GC can list all the unique chunks from dedupe file system in Eraser DB, considering all of them as potential garbage chunks. Then the GC can iterate over all the valid backup images and filter out their data chunks from Eraser DB. This is how GC finds out list of garbage and orphan chunks from dedupe file system. In this case the ongoing uploads and downloads have created new data chunks but not the metadata for that dedupe image. Accordingly, the GC is in a ‘data gathering’ state and considers these partial uploaded or downloaded chunks as orphan chunks and deletes them from the system. To overcome this problem we changed the upload and download process.
It is further noted that when a GC operation is running on a site and at the same time if other sites present in the dedupe storage network starts uploading data to that site or if onsite starts downloading data from another site various conflicts can arise. For example, a replication site uploaded dedupe file system specific metadata after GC prepared its garbage chunk list in Eraser database (DB). In this case if the replication site wants to upload a chunk which is also included in Eraser DB, then whether the upload happens first or chunk deletion by GC happens first can result into backup image corruption. Onsite downloaded dedupe file system specific metadata after GC prepared its garbage chunk list in Eraser DB. In this case onsite never downloads the chunk which is present locally in dedupe file system. The data download process relies on the locally available copy of data chunk for dedupe image creation. If the download process relying on a data chunk which is also part of Eraser DB, then garbage chunk deletion by GC will eventually make the downloaded image corrupt. Both these problems occur because GC state machine is transparent to upload and download processes. When GC in ‘data deletion’ state, backup process gives new life to data chunks by adding hardlink to the chunks. But since replication process is not aware of GC state machine it cannot give new life to garbage chunks.