1. Field of the Invention
Embodiments of the present invention generally relate to data de-duplication systems and, more particularly, to a method and apparatus for optimizing a de-duplication rate for backup streams.
2. Description of the Related Art
Performing regular backups of mission critical data is an ineluctable affair for small to large organizations to prevent a data loss. Often, several copies of the same data are backed up over a storage lifecycle. As a result, a ten megabyte file is redundantly stored numerous times, which results in storage space and network bandwidth wastage. For example, a hundred users may receive a particular email as well as an attachment of a one megabyte (MB) file. If such an email is backed up, then each and every copy of the email is stored, requiring a hundred MB of storage space. To optimally balance storage space and network bandwidth requirements, data de-duplication techniques are employed.
Generally, the data de-duplication techniques identify redundant data in a backup stream (e.g., image based or volume based backups) and passes only unique data to a storage device. Conventional data de-duplication techniques may employ various algorithms such as a fixed sized algorithm (e.g., SHA-1 or MD5), a variable sized algorithm and/or the like, in different approaches (e.g., Hash or content aware based approaches). For example, in the fixed size algorithm based de-duplication techniques, the backup stream is segmented into data blocks of fixed size. Further, the data blocks are assigned unique values (e.g., a hash value). A new hash value of each data block of the backup stream is compared to a hash value of the data which is already stored on the storage device. If the new hash values do not match, then the corresponding data block of the backup stream is stored on the storage device and the new hash value is added to a lookup table. If the new hash value matches the hash value that is previously stored in the lookup table, then the data block is not backed up and thereby eliminating the redundant data. Further, composition of the corresponding data block may be recorded on the storage device to reconstruct a data file during succeeding restoration.
Typically, in an image or volume based backups, the backup stream is sent as a contiguous data stream. Further, if the fixed size algorithm based de-duplication technique is utilized, then the backup stream is segmented into fixed size data blocks in an exact sequence of the contiguous data stream and the redundant data is identified from the sequence of the data blocks. In one scenario, where different computing devices are interconnected to each other through a network (e.g., LAN, WAN and/or the like), an identical file may be stored on more than one storage devices.
Due to little or no organizational similarity for the storage devices, locations of identical files on different storage devices vary. As such, backup streams from different storage devices may differ. For example, a particular file may be an operating system file that is identical across a plurality of clients. However, the particular file may be stored at different locations within each partition. The de-duplication techniques cannot recognize identical data files at different locations within the backup streams. Hence, the fixed size based de-duplication technique cannot de-duplicate the backup stream efficiently, which reduces a de-duplication rate. Furthermore, various de-duplication techniques utilize details of file boundaries in the backup stream to remove the redundant data files. Because image-based backups are presented as one large file, the de-duplication techniques cannot determine data file boundaries, which reduces the de-duplication rate.
Therefore, there is a need in the art, for a method and apparatus for optimizing a de-duplication rate for backup streams.