1. Field of the Invention
Embodiments of the present invention generally relate to data protection and deduplication systems and, more particularly, to a method and apparatus for integrating data deduplication with block level incremental data backup.
2. Description of the Related Art
Small to large enterprises utilize block level incremental backup (BLIB) technologies to preserve valuable computer data in order to optimize storage utilization and concurrently reduce downtime during a backup process (i.e., a backup window). The BLIB is configured to maintain information associated with data blocks of files that may be stored in a volume or a backup (e.g., a file by file backup, a raw partition backup, a snapshot and/or the like). Such information may indicate one or more modified data blocks associated with a particular file since a previous backup (e.g., incremental backup, full backup and/or the like) to storage media (e.g., a hard disk array).
For example, the BLIB technologies may be used to identify one or more changed data blocks (i.e., data blocks having new content), one or more added data blocks as well as one or more deleted data blocks associated with the particular file. The BLIB technologies proceed to backup the file by copying the one or more changed data blocks and/or the one or more added data blocks as well as updating a block mapping to record the deletion of the one or more deleted data blocks. Sometimes, the one or more changed data blocks and/or the one or more added data blocks may include identical data (i.e., duplicate data blocks). Unfortunately, the BLIB technologies cannot identify the duplicate data blocks and backs up each and every changed block and/or added block as a result.
Because the BLIB technologies are unable to identify the duplicate data blocks, small to large enterprise also employ deduplication software (e.g., a Single Instance Storage (SIS) solution) to remove duplicate data from computer data (e.g., a volume, a file, a virtual machine image, backup data and/or the like). For example, the computer data may be a backup chain (e.g., a full backup and one or more incremental backups of a source volume that comprises a plurality of files). Storing duplicate data blocks within the backup chain wastes valuable computer resources (e.g., computer memory, processors and/or the like) and/or requires unnecessary storage operations.
The deduplication software reads a particular file in logical order, converts each and every logical data block of the file into one or more deduplication segments (e.g., SIS segments) and calculates a unique signature for each deduplication segment. The conversion is performed because of a size disparity between data blocks (e.g., logical and/or physical data blocks) and deduplication segments. These signatures are compared with pre-populated signatures for the computer data (e.g., files, volumes and/or the like). If a signature associated with a certain deduplication segment matches a pre-populated signature, then there is duplicate data within the computer data. Comparing each and every deduplication segment, however, consumes the valuable computer resources. Especially during incremental backups, many of the deduplication segments are already present in the computer data and very likely to be match signatures for the computer data. Therefore, there is no need to compare each and every deduplication segment if only a few are likely to be backed up.
In other words, the deduplication software is unable to identify which data blocks of the particular file are modified. Current attempts to integrate BLIB technologies with the deduplication software are replete with problems. For example, there is a size disparity between a deduplication segment and a data block as used in BLIB. Each size is configured by respective administrators and thus, most likely different. Some integration attempts do not operate at a deduplication segment (i.e., SIS segment) granularity but at a physical block level granularity. Signatures are computed for changed/added physical data blocks during each incremental backup (e.g., in a changed data block stream). Unfortunately, these signatures will most likely never match signatures that are computed at the deduplication segment granularity due to the size disparity between physical data blocks and the deduplication segments.
Moreover, adding, deleting and/or changing physical data blocks causes an alignment adjustment between each and every logical data block of the file and the deduplication segments due to the size disparity. Adding a data block to an end of the file results in a different deduplication segment to logical data block correspondence. For example, a deduplication segment may be two-fifths of a data block size. When the data block is added to the end of the file, at least three deduplication segments are re-aligned to accommodate the added data block. As such, one or more deduplication segments may now be aligned with a logical data block that is adjacent (e.g., previous) to the added data block. In addition, deleting a data block from middle of the file causes a ripple effect to the alignment between adjoining logical data blocks and each and every corresponding deduplication segment. The current attempts to integrate data depulication and block level incremental backup do not address such alignment adjustments caused by the size disparity between the deduplication segment and the data block.
Duplicate data blocks are repeatedly backed up due to alignment adjustments caused by adding, changing and/or deleting data blocks. The deduplication software may determine that a particular file is modified but is not able to determine which data blocks are modified. As a result, the deduplication software processes an entire file even if only one or more data blocks of the file have been modified. Computing and comparing signatures for unmodified blocks increases deduplication time and computer resource utilization. Furthermore, employing a physical block level granularity for signature computation results in very few matches and wastes valuable computer resources.
Therefore, there is a need in the art for a method and apparatus for efficiently integrating data deduplication with block level incremental backup.