Cloud computing and the long-term storage of business documents has significantly increased data storage requirements. This has necessitated the use of data storage devices with larger storage capacities. Consequently, ever larger amounts of data are available to users. To reduce the requirement of ever larger numbers of data storage devices, the conservation of storage space and the use of space saving techniques has become particularly important.
Data deduplication is one way of performing duplicate data detection and removal from storage. Data deduplication is used to reduce the amount of space required to store files by recognizing redundant data patterns. For example, a deduplicated data system may reduce the amount of space required to store similar files by dividing the files into chunks and storing only unique chunks. In this example, each deduplicated file may simply consist of a list of chunks that make up the file.
Traditional deduplicated data systems often divide files into fixed-width chunks. But, this approach often overlooks large amounts of duplicate information, because a long sequence of data in one file may begin at a fixed-width chunk boundary of the file, while the same long sequence of data may begin in the middle of a fixed-width chunk of another file, resulting in no identical fixed-width chunks that may be deduplicated between the two files.
To facilitate the deduplication of identical sequences of data at arbitrary offsets within files, some data deduplication methods divide files into variable-width chunks. Unfortunately, determining the optimal chunk boundaries typically involves performing large numbers (millions or billions) of calculation for each file, resulting in the consumption of computing resources and time delays.