Deduplication and compression are two data reduction techniques that differ in the scale and granularity in which they operate. Deduplication finds repeating storage entities (e.g., files) across large data sets and volumes that potentially store large amounts of data (e.g., multiple terabytes/petabytes). Compression refers to data reduction at a local scale (e.g., inside every storage entity considered for deduplication). One popular method for compression is the Lempel-Ziv factorization that uses repetition elimination. The difference between deduplication and repetition elimination is that repetition elimination works by finding repetitions of byte strings inside a stream of data, typically at a bounded distance apart from one another (e.g., up to 32 KB apart). Other compression methods include entropy encoding, run length encoding and various image and video compression methods (both lossy and lossless).
Storage systems that store many copies of identical data can greatly benefit from having built-in deduplication. Examples are systems with backups, replication across many users, or many instances of virtual machine images. In operation, deduplication of identical storage entities can be implemented by storing a single physical instance of the storage entity and multiple logical instances of the storage entity, each of the logical instance comprising a pointer to the physical instance of the storage entity.
The description above is presented as a general overview of related art in this field and should not be construed as an admission that any of the information it contains constitutes prior art against the present patent application.