Data is very important to individuals and businesses. Many businesses regularly back up data stored on computer systems to avoid loss of data should a storage device or system fail or become damaged. One current data backup trend is to backup data to disks and use tapes for long term retention only. The amount of disk space needed to store a month's backup can be very large, such as around 70 terabytes in some examples. The amount of data will likely only be increasing going forward.
One strategy for backing up data involves trying to back up data that has changed, as opposed to all of the data, and then using prior backups of unchanged data to reconstruct the new backup. In one approach, data may be divided into fixed size blocks. An MD5 hash or a SHA256 hash may be calculated on the data belonging to the fixed size block of data, resulting in an MD5 signature for each block of data. The MD5 signature may be searched against an in memory database or an embedded database of previous MD5 signatures. In this approach any insertion of new data would cause blocks to shift and hence fixed size chunking and MD5 calculations on those fixed size chunks will not help.
The next time the file is backed up, signatures are generated for the blocks and searched against the database of signatures to find duplicates if any data has changed. Only the changed blocks need be saved during the backup.