This specification relates to data processing.
For large datasets, it is useful to find representations for data elements that provide compact storage and efficient distance computation between elements. Hash functions are typically used to represent the extracted features of input data with descriptors that require less storage space and whose similarity determination is computationally less expensive than using the original input data. Locality-sensitive hashing is one conventional method that uses a particular family of probabilistic hash functions to map similar input data to similar hashes.
Various types of data can be input to a hash function. A hash function maps each element of input data to a sequence of hash characters called a hash, where each hash character corresponds to a unique bit string. A hash collision occurs when a hash function maps two input data elements to the same hash. The feature representations of the input data elements can be hashed multiple times to generate multiple hashes for each input data element. The number of hash collisions between respective hashes for the two input data elements gives an empirical approximation of the overall hash collision probability, which in turn gives an approximation of the distance between the input data elements.
For example, a computer process can conventionally compute the similarity between two images by extracting features of each image to create a feature representation and can then compare the respective feature representations. Features of an image can include, for example, histograms of image color or grayscale data, edges, corners, image centers of gravity, or other image points of interest. The features extracted from an image can be concatenated into a feature representation. The feature representations are typically compared by various distance metrics, for example, the Jaccard distance, the L1 distance, or the L2 distance. However, these distance metrics may be computationally expensive when performed on the original feature representation. In addition, the variety of features extracted from an image may require storage space that is orders of magnitude larger than the storage space required to store the image itself. Consequently, hash functions are typically used to reduce the storage requirements of the feature representations and to improve the performance of distance computation between the images.