A hash function, when applied to input data generates an output hash that is a very compact (usually binary) representation of the data. Hashes are very useful for searching and indexing data A conventional hashing method generates a small string of bits from the data. The method provides a probabilistic guarantee that the hashes of two different data are very different, even if the data are very similar but not identical. Thus, searching for a data in a large database can be performed very efficiently by using the hash as an index to the database. Because the hash of the data is significantly smaller than the data, the search is significantly more efficient.
Two hashes are identical only if the data used to generate the hashes are also identical. This feature of conventional hashes makes them suitable for indexing of discrete data such as text, or to provide guarantees that the data have been tampered with.
On the other hand, the same feature makes the hashes unsuitable for indexing data such as time series, sound, biometric data, or images, in which the goal of searching is to retrieve similar but not always identical data. For example, when searching for an image, a slight degraded image is very similar, but not exactly identical to the original image.
Using conventional hashing, the hashes generated for the two images are very different despite the image similarity. Thus, using the conventional hash of one image to search for the other images does not work.
Locality sensitive hashing (LSH) is a general approach to constructing hashes in which the similarity of the hash reflects the similarity of the data being hashed. Using LSH, the hashes generated from two similar data are very similar, while the hashes generated from two dissimilar data are also very dissimilar. Thus, the LSH can be used to search in a database for similar images, or the hash of a fingerprint can be used to search a database for similar fingerprints.
However, the prior state of the art in LSH functions provides methods that produce hashes in which all the bits provide the same information.