Data storage devices are subject to data loss from a variety of causes, such as disk failure, unintentional deletion, malicious software attacks, or natural disaster. A common practice to guard against data loss is to create backup copies of important data and store them at a remote storage location. In the event of data loss or corruption, the backup copies are used to restore the lost or corrupted data to a previous state.
In some cases, backup storage systems maintain multiple backup data sets for one or more client systems. For example, a client system may perform periodic backups of its data, thereby creating a backup data set at regularly scheduled intervals. Accordingly, at each interval, the client system sends, to a backup server, a backup data set that includes a copy of each data file that is being backed up. The backup server stores each backup data set as it arrives, and frees up space for new backup data by deleting older backup data sets.
Various strategies exist for managing and storing backup data. One approach involves taking regular full backups. In this approach, the backup server stores a complete set of data at each backup interval. A full backup set includes a copy of every data file on the client system, whether the data file has changed or not relative to any previously-created backup sets. This approach is inefficient because most blocks in most data files do not change from one backup operation to the next. Therefore, storing multiple full backup sets results in a great amount of data duplication and, consequently, a greater consumption of storage space.
An alternative approach to backing up data, which is more efficient than always making full backups, involves storing incremental backup data. An incremental backup is a backup of every file which has changed since the last backup. According to this approach, a full backup set is initially stored on the backup server. After that, incremental backups are generated and stored at backup intervals. Incremental backups consume less storage than full backups by reducing or eliminating redundant backup storage of unchanged data.
In order to identify the data that has changed from one backup set to the next, the backup server typically uses a hashing algorithm. Specifically, the client system sends a full data set to the backup server, regardless of whether data files within the set have changed since the last backup. The backup server hashes each chunk of the data as it arrives and compares the hash value produced by the chunk to the hash value produced by the same chunk in the most recent prior backup operation. If the hash values for a chunk match the previous hash values for the same chunk, then the backup server discards the chunk. Otherwise, the backup server stores the chunk as part of the current incremental backup. To facilitate the hash comparisons, the backup server may use an index of previously-generated hash values. The chunks for which hash values are generated may vary from implementation to implementation. For example, some backup servers may generate hash values on a file-by-file basis.
There are multiple disadvantages to hash-based approaches for identifying and storing incremental data. First, hashing algorithms may result in collisions. Collisions occur when the hashing algorithm generates the same hash value for two distinct pieces of data. If the hash value generated for a changed chunk is the same as the hash value produced by the same chunk prior to the change, then the backup server may mistakenly discard the changed chunk as redundant, resulting in a corrupted backup data set.
Another disadvantage of hashing is that hashing is relatively resource intensive. The backup server typically needs to hash a very large data set to determine which data has changed. Furthermore, the backup server needs to store and query a large index of hash values to identify redundant data. To perform the hashing process in real-time as the data arrives from a client system requires a great amount of processing power and I/O resources from the backup server, particularly when the data sets sent from the client system are large.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.