Data storage utilization is continually increasing, causing the proliferation of storage systems in data centers. In particular, the size of the applications and the data generated there from is increasing. Moreover, systems/users are backing up multiple copies of a given set of data to maintain multiple versions. For example, snapshots of a given database stored in a server are copied and stored over time, thereby allowing a given version/snapshot of a set of data to be restored. Typically, much of the data remains the same across different snapshots. For example, if the data is backed up for a given user on a daily basis and such user is updating only one of the number of files for a given day, the data in this file is the only data that has been modified. Accordingly, conventional backup operations include the sharing of data across a number of different snapshots/versions to reduce the amount of data being stored.
Furthermore, in some backup systems, the data is not modifiable. In other words, if given data is updated, such data is copied to a different location in the backup storage device (instead of modifying the data at the current location). Over time, all of the storage space in the backup storage device will be allocated, as blocks of data in the backup system are no longer active (but still allocated) and need to be deleted there from. Therefore, garbage collection operations are performed within these backup systems to delete data that is no longer active/referenced, thereby reclaiming unused space in the backup storage device.
Garbage collection (GC) is typically performed on a regular basis (e.g., weekly). When the garbage collection runtime (i.e., duration) is long (e.g., exceeding several days), users/customers call manufacturers of these backup storage devices for support. By predicting these long garbage collection runtimes, backup storage systems manufacturers can proactively notify customers to prevent long garbage collection runtimes.