Traditional backup systems may periodically create a full backup by capturing all allocated blocks (e.g., sectors or clusters) of a volume. Between full backups, a backup system may capture intermediate backups, referred to as incremental backups, that include blocks that have changed since the previous incremental or full backup. An incremental backup may be orders of magnitude smaller and faster than a full backup. Because of the relative efficiency of incremental backups, many enterprises would prefer to take only incremental backups after an initial base backup (a feature referred to as infinite incrementals).
Unfortunately, some traditional backup technologies are not designed to effectively deal with long (or even short) chains of incremental backups. For example, some traditional backup systems may restore a volume (or any other backed-up entity) by opening and reading each backup in the entire chain of backups, including each incremental backup and a base backup (i.e., a full backup). Other traditional backup systems may only support opening a single file at a time when restoring a volume and therefore may not be able to handle chains of backups.
Because of the challenges associated with handling chains of incremental backups, many backup systems may take full backups periodically (e.g., weekly or monthly) and may take incremental backups between the full backups. Some traditional backup systems may perform differencing (i.e., determining a difference between a previous backup of a protected system and a current state of the protected system) to determine which data needs to be transmitted from the protected system for backup. The backup system may then construct and transmit a delta stream describing how the current backup differs from the previous backup. In certain systems, differencing may only be performed on full backups. In other systems, differencing may be performed on all backups.
In embodiments where the differencing is only applied to full backups, after a first base backup is created, incremental backups may be created without differencing. Systems that use such an approach may be inefficient because they may transmit a significant amount of backup data twice—once in an incremental backup and a second time in the next full backup. Extra storage and bandwidth may be consumed while handing the duplicate data.
As noted, an alternative approach to differencing only full backups may include differencing each backup, whether full or incremental. Differencing an incremental to a first full backup may result in discarding most of the full backup for use in subsequent differencing. The backup system may only apply differencing to the most recently saved version of the file, and as a result, a second full backup may only be compared to the incremental backup. Undesirably, differencing the second full backup to the incremental backup may result in retransmission of almost the entire second full backup.
An additional disadvantage of systems that perform differencing is the cost involved in performing the differencing. When processing a file (or other data unit) to prepare for differencing, a backup system may break the file up into blocks and may calculate hashes for each block. The backup system may compare a hash calculated for an old version of a block with a hash calculated for a current version of a block to determine whether data stored in the block has been modified. If the hashes do not match, the backup system may determine that the new version has been modified.
The amount of processing involved in creating and comparing hashes to identify changes may be substantial. For example, if a backup system is backing up 100 Gigabytes (“GB”) of data of a volume, the backup system may transfer all 100 GB over a network for the first full backup. For the next full backup, 100 GB may be read from the volume and broken up into blocks. The backup system may then calculate hashes for all the blocks and may compare them with hashes of blocks of the first full backup. If 99% of the blocks are the same and 1% of the blocks are different, 1 GB may be sent to the backup system to create a new full backup. Despite the reduction in data transfer, the backup system may have consumed a significant amount of resources reading and processing the 100 GB of data of the volume. What is needed, therefore, is a more efficient system for creating backups.