Deduplication is a technique commonly used in the computer storage industry to reduce the storage footprint by removing common portions of files. Deduplication is typically performed by reading a file and converting the read file into a stream of bytes. Then, smaller chunks of data are created from the stream of bytes. A cryptographic hashing algorithm, such as for example SHA-2, is used to create a unique fingerprint for each smaller chunk. The fingerprint for each chuck is then compared against a database of existing fingerprints.
If a particular fingerprint is not in the database of existing fingerprints, then the fingerprint is unique and (a) that fingerprint is added to the database of existing fingerprints and (b) the chunk of data associated with the new unique fingerprint is also saved as the chunk of data is unique and is not a duplicate of another chunk of data. However, if a particular fingerprint is already in the database of existing fingerprints, then the chunk of data associated with that fingerprint is a duplicate of a previously encountered chunk of data. When a duplicate chunk is encountered, a reference is added to the fingerprint that this chunk has been found in a different file at a particular offset. In addition, when encountering a duplicate chunk, the metadata for the original file is updated to include a fingerprint reference so that the file can be reconstructed during read operations. The duplicate chunk is not stored once more, thereby resulting in storage savings.