The amount of data to be stored continues to grow. In particular, the size of the applications and the data generated there from is increasing. Moreover, systems/users are backing up multiple copies of a given set of data to maintain multiple versions. For example, snapshots of a given database stored in a server are copied and stored over time, thereby allowing a given version/snapshot of a set of data to be restored. Typically, much of the data remains the same across different snapshots. For example, if the data is backed up for a given user on a daily basis and such user is updating only one of the number of files for a given day, the data in this file is the only data that has been modified. Accordingly, conventional backup operations include the sharing of data across a number of different snapshots/versions to reduce the amount of data being stored.
Furthermore, in some backup systems, the data is not modifiable. In other words, if given data is updated, such data is copied to a different location in the backup storage device (instead of modifying the data at the current location). As described, over time, all of the storage space in the backup storage device will be allocated, as blocks of data in the backup system are no longer active (but still allocated) and need to be deleted there from. Therefore, garbage collection operations are performed within these backup systems to delete data that is no longer active/referenced, thereby reclaiming unused space in the backup storage device.
A conventional approach for garbage collection of data includes a traditional mark-and-sweep garbage collection, wherein the data is periodically traversed. When traversing the data during a garbage collection operation, if the data is referenced, a tracking mechanism that includes a bit associated with the referenced data is set to indicate that the data is referenced. Accordingly, during a subsequent sweep of the data, for data whose associated tracking bit is set, such data is copied to a different location in the backup system. The range of the address space of the data that has been swept is then marked as unallocated, thereby allowing the backup system to overwrite this range of the address space with new data. There is commonly two ways of implementing the tracking mechanism. The first way is to store each of the tracking bit as part of the data itself that the bit is tracking. In such an implementation, during garbage collection operations, the system will typically require multiple accesses to the secondary storage to retrieve the tracking data stored with the data to determine if the data is to be deleted because such data is no longer referenced.
In an effort to reduce accesses to the secondary storage device, the second way of implementing the tracking mechanism is to use a separate data structure for the tracking bits. In particular, the collection of these tracking bits into a separate data structure makes the tracking bits more manageable. Even so, the data structure can become very large as the amount of data that is within the backup storage device grows. In particular, there is a one-to-one relationship between given data and a bit within the data structure. As the data structure continues to grow, typical systems will not have local memories that are large enough to store this structure. Accordingly, when there is insufficient space in the local memory, at least part of the data structure must be stored in a secondary storage device, such as a hard drive (external to local memory). Therefore, during garbage collection operations, the system has to perform a number of disk accesses to the hard drive to access this structure, thereby slowing down the operation as the number of secondary storage accesses increases.
Moreover, certain conventional mark-and-sweep garbage collection operations perform a targeted cleaning of a segment of address space in the backup storage device. However, such approaches require the walking of all of the blocks of data to determine which blocks of data are considered inactive and thus eligible for garbage collection. In particular, these mark-and-sweep garbage collection operations require the walking of all of the blocks of data in order to determine whether a given block of data is referenced by any other block of data. Disadvantageously, such an approach is not feasible when the number of blocks of data becomes too great. For example, if a snapshot of a file system is a terabyte in size and such a snapshot is taken every 10 minutes, over time the walking of the blocks of data during the marking of the garbage collection operations will not have completed prior to another snapshot being stored within the backup system.