Deduplication reduces the memory space required to store digital information where more than one user or application references data that is the same data. For example, a same attachment may be found in an email sent to a plurality of recipients and each one of the recipient's email and attachment may be stored in a storage system. The attachment may be forwarded to another recipient or recipients and the attachment stored again. Where the data is stored in a common memory system, the aim of deduplication is to store the data only once, and to provide access to the data through a mapping using metadata. For certain types of data this is very effective in reducing the storage requirements.
The size of a deduplicated memory area is reduced by eliminating unnecessary copies of the data. There is some additional memory required to store the metadata to keep track of the stored single copy of the deduplicated data, and additional processing time to perform the deduplication process and to retrieve deduplicated data, but the overall effect, in practice, is to substantially reduce the required memory space.
Many types of data also exhibit internal redundancy within a page or file. That is, the data may have patterns of repetitions which may be represented in a more compact form. The lower the Shannon Entropy of the data of an extent of data, the less information that is represented by the extent of data and the number of bytes that may be needed to represent an extent of data may be reduced. This process is termed data compression. Depending on the type of data to be stored, the compression may be performed in a lossless or lossy manner. Lossless compression is a reversible process where the data may be exactly recovered by decompression. Lossy compression results in some loss of data fidelity, but may be acceptable for images, sound, and the like where reproduction limitations may limit the fidelity of the data without impacting the user.
Where the term compression is used, either type of compression may be meant, where the specific compression algorithm is selected based on the type and use of the data, although data compressed by lossy compression techniques cannot be restored to exactly the original form. Compressing a page of data, for example, results in the data of an original page being represented by less than the number of bytes needed to store the original data page in an uncompressed form.
Storage media such as rotating disks, FLASH memory circuits or the like are ordinarily based on some standardized media format or access protocol. This has led to the formatting of data in accordance with constructs such as sector, page, block or the like, each one having a fixed size, at least in a particular example. In such an example, typical rotating disks are formatted into sectors (512 bytes) and pages comprising multiple sectors. For example a 4 KB page would have 8 sectors. This terminology has evolved historically and one may find sectors that are multiples of 512 bytes and pages that are multiples or sub-multiples of 4K, this is the nominal size although typically there is a additional spare area that may be used for metadata such as an error correcting code (ECC) and other descriptive information. This area may, however, be used for data and in the same manner the data area could be used for metadata.