1. Field
This disclosure is generally related to data synchronization. More specifically, this disclosure is related to comparing additive hash values that represent collections of content item names to determine whether a local data collection and a remote data collection are synchronized.
2. Related Art
In many computing applications, it is often important for two remote data collections to synchronize their data if their collections are not in agreement. However, to determine an agreement level between the two remote data collections, these two computer systems may need to exchange information based on a substantially large sample of their data.
A commonly used measure of the agreement between two collections A and B of data objects (files) is the “overlap.” This overlap can be computed as the number of objects in the intersection of A and B divided by the number of objects in the union (|A∩B|/|A∪B|), which will be a real number between 0 and 1. Individual data objects are typically represented by checksums that are computed from the contents of the data objects, for example 128-bit hash function values. If two checksums agree, it is highly likely that the two data objects also agree. Unfortunately, computing the checksums for large data files can consume substantial processing time.
To estimate the overlap between a local data collection A and a remote data collection B, a computer system may receive some or all of the checksums for the data objects in the remote collection B, and compare these checksum values to those for local collection A. However, the overlap estimate may be highly inaccurate unless all the checksums are transferred, and communicating these checksum values for the remote data collection can involve a high-bandwidth file transfer operation.
Some overlap-estimation systems reduce the amount of communication by using min-wise hashing to generate a sketch vector. In this technique, there is a set of n universally known hash functions, h1, h2, . . . , hn, that are used to generate n hash values for each of the data objects in a collection A. The collection A is then represented by a “sketch” vector of n numbers that are generated from these hash values, and the overlap of collections A and B can be estimated by the overlap of their sketches. Unfortunately, generating the sketch vector can consume substantial processing time for large files, given that it requires generating a plurality of different hash values from the data files' contents.