Deduplicated data systems are often able to reduce the amount of space required to store files by recognizing redundant data patterns. For example, a deduplicated data system may reduce the amount of space required to store similar files by dividing the files into chunks and storing only unique chunks. In this example, each deduplicated file may simply consist of a list of chunks that make up the file.
Some traditional deduplicated data systems may divide files into fixed-width chunks. Unfortunately, this approach may result in large amounts of duplicate information that will not be deduplicated. For example, a long sequence of data in one file may begin at a fixed-width chunk boundary of the file, while the same long sequence of data may begin in the middle of a fixed-width chunk of another file, resulting in no identical fixed-width chunks that may be deduplicated between the two files.
In order to facilitate the deduplication of identical sequences of data at arbitrary offsets within files, some traditional deduplicated data systems may divide files into variable-width chunks. Unfortunately, traditional methods for determining optimal chunk boundaries may involve performing millions or billions of operations per file, thereby consuming significant time and computing resources. Accordingly, the instant disclosure identifies and addresses a need for additional and improved systems and methods for variable-length chunking for deduplication.