Data is very important to individuals and businesses. Many businesses regularly back up data stored on computer systems to avoid loss of data should a storage device or system fail or become damaged. One current data backup trend is to backup data to disks and use tapes for long term retention only. The amount of disk space needed to store a month's backup can be very large, such as around 70 terabytes in some examples. The amount of data will likely only be increasing going forward.
One strategy for backing up data involves trying to back up data that has changed, as opposed to all of the data, and then using prior backups of unchanged data to reconstruct the backed up data if needed. In one approach, data may be divided into fixed size blocks. An MD5 hash or a SHA256 hash may be calculated on the data belonging to the fixed size block of data, resulting in an MD5 signature for each block of data. The MD5 signature may be searched against an in memory database or an embedded database of previous MD5 signatures.
The next time the file is backed up, signatures are generated for the blocks and searched against the database of signatures to find duplicates if any data has changed. Since the data being backed up may be very large, there can be a large number of signatures.