Digital data storage systems can utilize various techniques to minimize the amount of storage that is required for storing data. Such storage minimization techniques not only save time in terms of faster data processing, but can reduce redundancy and minimize storage costs as well.
One such storage optimization technology is data deduplication. Data deduplication employs a scheme in which the same block of data (or single segment) is simultaneously referred to by multiple pointers in different sets of metadata. In this manner, the block of data that is common to all data sets is stored only once, and duplicate copies of repeating data are eliminated.
A chunk-level data deduplication system is one that segments an incoming data set or input data stream into multiple data chunks. The incoming data set might be backup files in a backup environment for example. As another example, the incoming data set might be database snapshots, virtual machine images or the like. Data deduplication not only reduces storage space by eliminating duplicate data but also minimizes the transmission of redundant data in network environments.
Each incoming data chunk can be identified by creating a cryptographically secure hash signature or fingerprint, e.g., SHA-1, SHA-2, for each such data chunk. An index of all of the fingerprints with each one pointing to the corresponding data chunk is also created. This index then provides the reference list for determining which data chunk has been previously stored.
In fixed-length block deduplication, the multiple data chunks are fixed in size, i.e., they are segmented into fixed blocks. The length of the blocks may be 4K-Byte, for example. As another example, the length may be 16K-Byte. In variable-length deduplication, the multiple data chunks are segmented into variable-sized block units. Here, the length of each variable-sized unit is dependent upon the content itself.
In common practice, an incoming data chunk and a preceding data chunk may vary by a single burst. In backup systems, for example, single files are backup images which are made up of large numbers of component files. These files are rarely entirely identical even when they are successive backups of the same file system. A single addition, deletion, or change of any component file can easily shift the remaining image content. Even if no other file has changed, the shift would cause each fixed sized segment to be different than it was last time, containing some bytes from one neighbor and giving up some bytes to its other neighbor.
Generally, existing data deduplication systems and methods can be computationally costly and inefficient and can often result in storage of redundant or duplicate data particularly within the context described above. It is within this context that a need arises to address one or more disadvantages of conventional systems and methods.