1. Field
This disclosure is generally related to data synchronization. More specifically, this disclosure is related to generating a vector of hash function values (a “sketch”) representing a collection of data.
2. Related Art
In many computing applications, it is often important for two remote document collections to synchronize their data. Moreover, if document collections from two remote computer systems are meant to be identical, then their data is likely to agree at the 99% level. However, to determine such high-overlap estimations with a reasonable level of accuracy (e.g., ±0.5%), the information exchanged between the two computer systems will need to be based on a substantially large sample of the data.
A commonly used measure of the agreement between two collections A and B of data objects (files) is the “overlap.” This overlap can be computed as the number of objects in the intersection of A and B divided by the number of objects in the union (|A∩B|/|A∪B|), which will be a real number between 0 and 1. Individual data objects can be represented by checksums, for example 128-bit hash function values, such that if two checksums agree, it is highly likely that the two data objects agree. To estimate the overlap between a local data collection A and a remote data collection B, a computer system may receive some or all of the checksums for the data objects in the remote collection B, and compare these checksum values to those for local collection A. Unfortunately, the overlap estimate may be highly inaccurate unless all the checksums are transferred.
Moreover, communicating the checksum values for the remote data collection can involve a high-bandwidth file transfer operation that makes it infeasible to frequently compare the contents of the local and remote file collections. If the computer system desires to estimate the overlap frequently, the remote system may need to reduce the amount of transferred information by generating checksum values for a small subset of files, at the cost of significantly reducing the quality of the overlap estimation.
Some overlap-estimation systems reduce the amount of communication by using min-wise hashing to generate a sketch vector. In this technique, there is a set of n universally known hash functions, h1, h2, . . . , hn, and a collection A of data objects is represented by a vector of n numbers, (min1, min2, . . . , minn), where mini is the minimum value of hi over all data objects in A. The vector of minimum values is called a “sketch,” and the overlap of collections A and B can be estimated by the overlap of their sketches.
In many applications, accurate overlap estimation is especially important for collections that have a high overlap (e.g., 90%-100%). However, the precision of the estimate depends upon the size of the sketch. For example, if n is less than 100, the sketch may not reliably distinguish 97% from 98% overlap, and if n is less than 1000, the sketch may not reliably distinguish 97.7% from 97.8% overlap. Therefore, these overlap-estimation systems may require an undesirably large sketch vector to compute a detailed overlap estimate for two collections that are expected to be nearly identical.