1. Technical Field
The present teaching relates to methods, systems and programming for data processing. Particularly, the present teaching is directed to methods, systems, and programming for computing measures to be used to identify similar data.
2. Discussion of Technical Background
Locality sensitive hashing (LSH) is a basic primitive in large-scale data processing algorithms that are designed to operate on objects (with features) in high dimensions. The idea behind LSH is to construct a family of functions that hash objects into buckets such that objects are similar will be hashed to the same bucket with high probability. Here, the type of the objects and the notion of similarity between objects determine the particular hash function family. Typical instances include utilizing a Jaccard co-efficient as similarity when the underlying objects are sets and a cosine/angle measure as similarity when the underlying objects are vectors in the Euclidean space.
Using this technique, large-scale data processing problems are made more tractable. For instance, in conjunction with standard indexing techniques, it becomes possible to do nearest-neighbor search efficiently: given a query, hash the query into a bucket, use the objects in the bucket as candidates, and ranking the candidates according to the similarity of each candidate to the query. Likewise, popular operations such as approximate nearest neighbor, near-duplicate detection, all-pairs similarity, similarity join/record-linkage, temporal correlation, are simplified.
An example of this approach is a very simple locality sensitive hashing (LSH) approach: the hash of an input vector is the sign of its inner product with a random unit vector, commonly termed “SimHash” (similarity hashing meaning similar features hash to similar values—see http://www.fatvat.co.uk/2010/09/lets-get-hashing.html or “Similarity Estimation Techniques from Rounding Algorithms” by Charikar, Moses S., Proceedings of 34th STOC, pages 380-388, 2002). It can be shown that the probability of the hashes of two vectors agreeing is a function of the angle between the underlying vectors. To improve the accuracy of this basic method, multiple random unit vectors are used. However, this approach requires more space and yields longer query time.
There have been efforts to make SimHash more efficient and practical. One method utilizes an entropy-based LSH to (provably) reduce the space requirements. In this scheme, in addition to considering the bucket corresponding to the query, buckets corresponding to perturbed versions of the query are also considered. Unfortunately, while the space requirements are reduced, the query time is considerably increased. Another method utilizes a careful probing heuristic to look up multiple buckets that have a high probability of containing the nearest neighbors of a query. They obtain both space and query time improvements, but are not able to offer any provable guarantees of performance.
Currently technologies fall into three main categories: (1) data structures and indexing for similarity search, (2) LSH and related methods, and (3) the use of FFT-like methods for dimensionality reduction. There have been several indexing data structures proposed for nearest-neighbor search and approximate nearest neighbor search. Examples include R-tree, K-D tree, SR-tree, VA-file, A-tree, and AV-tree. However, these index structures do not scale well with the dimension of the data.
In addition, attempts have been made to obtain a fast version of a Johnson/Lindenstrauss transform. Furthermore, a version of the Johnson/Lindenstrauss theorem using circulant matrices and Gaussian random variables has also been developed. It is unclear, however, if this method can be adapted to either a Hadamard transform setting or to the angle setting (as opposed to the distance setting in the Johnson/Lindenstrauss theorem). Therefore, a desire exists to develop a faster similarity computation scheme to make data retrieval from large archives more feasible.