Data deduplication, the identification and reduction of duplicate or near duplicate data, is a goal in computer science. For example, data deduplication techniques can be used to reduce duplicate documents in a search engine index, help teachers identify plagiarized portions of a student paper, and improve data compression and transmission. By removing duplicate data and/or increasing the compression of existing data, overall hardware, networking, and energy costs may be reduced for a variety of organizations.
One method for data deduplication is the selection of landmarks in data files. The landmarks associated with a file are typically hash values generated from portions of the files. The landmarks may then be used to bound chunks in the data files, for example. One such technique that determines landmarks is known as winnowing. Winnowing, as introduced by Schleimer, Wilkerson, and Aiken, is a powerful technique for selecting landmarks. However, existing winnowing methods may not be optimal in some situations.