1. Field of Art
The invention generally relates to comparing digital data, and more specifically to comparing digital data using signatures.
2. Description of the Related Art
A min-hash signature is a coding technique often used to quickly estimate similarities between two bit vectors or to quickly find approximate nearest neighbors from a set of bit vectors. By representing bit vectors using min-hash signatures, data can be compared far more efficiently than using direct bit-to-bit comparisons. Min-hash signatures are most often applied when the bit vector contains a large number of “expected values” relative to a number of “not-expected values”. Typically, the “expected values” are represented by 0's in the bit vector and the “not-expected values” are represented by 1's in the bit vector, although other representations are possible. Example applications for min-hash signatures include quick comparisons of digital media files such as video, audio, images, or webpages.
The min-hash process generates a signature for a bit vector by sequentially applying a set of k permutations to the bits in the bit vector, where k is typically much less than the length of the bit vector. Each permutation defines a bit re-arrangement of the bits in the bit vector. After applying a permutation, the “min-hash” value is an outputted value corresponding to the bit location of the first ‘1’ in the re-arranged bit vector. The sequence of min-hash values from the set of applied permutations collectively make up the min-hash signature. Thus, the min-hash process compresses a long bit vector to a more compact vector (the signature) with a length of k values. The signature is computed in such a way that the signatures retain a sufficient level of information about the original bit vectors to allow bit vectors to be compared by comparing only their signatures.
The magnitudes of the values obtained in the min-hash signature are related, in part, to the number of “1”s in the input bit vector relative to the length of the input bit vector. For example, a bit vector may, on average, have roughly ⅛, 1/20, or 1/80 of its bits correspond to “1”s. A larger fraction of “1”s generally results in lower average min-hash values because fewer bits are scanned (on average) before the first “1” is located. These low-valued entries are inherently less informative about the underlying sequence of bits in the original bit vector than high valued entries. To illustrate this concept, the min-hash process can be viewed as a variation of a run-length encoding. For example, a min-hash value of 50 indicates a run of 50 “0”s followed by a “1” in the re-arranged bit vector after applying a permutation. Given this single min-hash value, the values in the original sequence (50 “0”s and one “1”) can be recovered. However, if the min-hash value indicates a run of zero “0”s, only the value for a single entry (the single “1”) can be recreated. Thus, different min-hash values encode different amounts of information about the original bit vector, depending on the actual output value (with higher values encoding more information).
While low min-hash values have less discriminative power due to their relative lack of information about the input bit vector, high min-hash values are increasingly susceptible to distortion-induced errors. For example, consider the case where distortions are modeled as randomly distributed bit flips. The higher the “true” output value of the min-hash process, the more likely that a distortion will change that value, since there are more bits on which this value depends.
Thus, conventional min-hash processes are limited by an uneven distribution of information about the original bit vector (at low output values) and susceptibility to distortions or errors (at high output values).