Growing complexity of storage infrastructure requires solutions for efficient use and management of resources. The use of a virtualized storage system enables to present to the user a logical space for data storage while the storage system itself handles the process of mapping it to the actual physical location. Today many virtualized storage systems implement data deduplication. Data deduplication is a data compression technique of optimizing the efficiency of utilization of available storage space in a storage system (including storage area network systems (SAN)). In the deduplication process a single copy of a data unit is stored while duplications of identical data units are eliminated and only a virtual representation of these units is maintained. In response to a request (e.g. to a read request) for these data units, the data can be easily reconstructed. By storing a single copy of each data unit, deduplication enables to reduce the required storage space of a physical storage.
In some cases it may occur that substantial parts of different data units are identical while significantly smaller parts are non-identical. In such scenarios currently known deduplication techniques which require complete identity between the data unit, would determine such data unit as non-identical and store all of different data portion on the physical storage.
Prior art references considered to be relevant as background to the invention are listed below. Acknowledgement of the references herein is not to be inferred as meaning that these are in any way relevant to the patentability of the invention disclosed herein.
US Patent Application No. 2009234892 discloses a system and method for assuring integrity of deduplicated data objects stored within a storage system. A data object is copied to secondary storage media, and a digital signature such as a checksum is generated of the data object. Then, deduplication is performed upon the data object and the data object is split into chunks. The chunks are combined when the data object is subsequently accessed, and a signature is generated for the reassembled data object. The reassembled data object is provided if the newly generated signature is identical to the originally generated signature, and otherwise a backup copy of the data object is provided from secondary storage media.
US Patent Application No. US2008005141 discloses a system and method for calculating and storing block fingerprints for data deduplication. A fingerprint extraction layer generates a fingerprint of a predefined size, e.g., 64 bits, for each data block stored by a storage system. Each fingerprint is stored in a fingerprint record, and the fingerprint records are, in turn, stored in a fingerprint database for access by the data deduplication module. The data deduplication module may periodically compare the fingerprints to identify duplicate fingerprints, which, in turn, indicate duplicate data blocks.
U.S. Pat. No. 7,822,939 discloses a system for de-duplicating data, which includes providing a first volume including at least one pointer to a second volume that corresponds to physical storage space, wherein the first volume is a logical volume. A first set of data is detected as a duplicate of a second set of data stored on the second volume at a first data chunk. A pointer of the first volume associated with the first set of data is modified to point to the first data chunk. After modifying the pointer, no additional physical storage space is allocated for the first set of data.
Thus, there is still a need in the art to further improve the efficiency of the utilization of physical storage and provide deduplication techniques which enable to further reduce storage space requirements.