Many systems for backing up a set of user data use a method in which a full backup is performed initially by copying all of the dataset to be backed up to a storage device and then performing subsequent “incremental” backup operations. These operations are “incremental” in that they involve copying only data that has been added to or changed in the dataset since the last incremental backup. In other words, in the subsequent incremental backups, only modified or newly added data (blocks or files) are sent to the storage device for backup in the archive.
One benefit of this type of conventional incremental backup is that the typical amount of data that is transferred during an incremental backup is far less than the amount of data that would be transferred in a full backup. This result can be significant as it can be very time-consuming to perform a full backup of the dataset especially if it is performed using remote online storage, such as a data cloud.
One technical disadvantage of the conventional incremental backup method described above is the possible threat to data integrity in the archive. In particular, recovery of the dataset through creation of an updated version of the full dataset requires data from all of the intervening incremental backups performed since the last full backup. In other words, reconstituting the full dataset requires data from the entire chain of backups, starting from the initial full backup and ending at the last incremental backup (closest in time to the point at which the recovery of the data is planned). If data from even one of incremental backups is damaged, successful recovery of the full chain after the point at which the data was damaged becomes impossible.
Moreover, it should be understood that the risk of interim backup data being damaged will increase as the number of interim data backups increases. Moreover, since many data storage systems can have hundreds or even thousands of incremental backups, the potential risk for lack of data integrity is severe. This risk is compounded by the risk of damage not only to the segments of the chain of incremental backups, but to the initial full backup as well, making it virtually impossible to recover all data from beginning to end using standard methods.
Some conventional systems perform a periodic full backup of the dataset to reduce the risk of data loss due to accidental damage to some extent. However, as noted above, the large amount of data to be backed up as well as the limited speed of data communication channels makes frequent full backups problematic and capable of imposing a substantial burden on backup infrastructure and communication channels.
Accordingly, there is a need for a system and method for backing up data that reduces the risk that recovery will become impossible while at the same time reducing the need for periodic full backups.