Modern search engines attempt to return results based not just on the forms of words in a query, but on their meaning, or semantic content. That is, search engines attempt to return results based on semantic similarity. For example, search engines may attempt to search documents not just based on query terms themselves but may additionally attempt to use terms that are related (semantically similar) to augment the search.
To calculate the similarity between natural language terms, a common technique is to employ a vector comparison. In particular, such techniques convert the terms into vectors based on their distribution among documents or nearby terms. The similarity of the corresponding terms is then determined by calculating the similarity of the vectors using a metric such as cosine similarity. That is, each vector must be compared to each other vector. However, to find the terms most similar to a given query term, it is onerous and time-consuming to check every other term in the data set, calculating the similarity for each.
Consequently, techniques have been developed to reduce the complexity of the similarity calculations. One technique for scaling the term semantic similarity calculation is to index terms with “signatures” using locality sensitive hashing (LSH). (Hashing is a technique that maps large sets of data of variable length to smaller sets of fixed length). A comparison between index signatures is typically much faster that a comparison between vectors, and approximates the similarity between the terms themselves.
In general, these index signatures are used to index and group term vectors. To look up related terms, they are used to extract a subset of candidate vectors which can themselves be compared.
Different LSH techniques may be used to approximate different similarity measures. For cosine similarity, an LSH technique called “simhash” is often used. Simhash compares a vector to n random vectors and computes an n length bitset signature from the results. However, these comparisons can take a long time and this requires maintaining resources—a set of random vectors—which complicates deployment in a distributed computing environment. In some systems, a technique known as random indexing (RI) may be used for dimensionality reduction, specifically for the construction of vectors representing terms (or documents) in a large data-set with a fixed-dimensionality vector space representation.
LSH and simhash techniques are described in Moses S. Charikar, “Similarity estimation techniques from rounding algorithms,” Proceedings of the thirty-fourth annual ACM symposium on Theory of computing. ACM, 2002; Moses S. Charikar, “Methods and apparatus for estimating similarity,” U.S. Pat. No. 7,158,961; Piotr Indyk and Rajeev Motwani, “Approximate nearest neighbors: towards removing the curse of dimensionality.” Proceedings of the thirtieth annual ACM symposium on Theory of computing, ACM, 1998; and Deepak Ravichandran, Patrick Pantel, and Eduard Hovy, “Randomized algorithms and NLP: Using locality sensitive hash functions for high speed noun clustering,” Proceedings of the 43rd Annual Meeting of the ACL, pages 622-629, Ann Arbor, June 2005, © 2005 Association for Computational Linguistics, all of which are hereby incorporated by reference in their entireties as if fully set forth herein.
Some methods additionally make use of a technique known as “random indexing,” a method for constructing low dimensionality vectors for semantic similarity calculation. Details on random indexing techniques are described in Magnus Sahlgren, “An introduction to random indexing,” Methods and Applications of Semantic Indexing Workshop at the 7th International Conference on Terminology and Knowledge Engineering, TKE, Vol. 5, 2005; and Erik Veldal, “Random Indexing Re-Hashed,” Bolette Sandford Pedersen, Gunta Nespore and Inguna Skadin, (Eds.), NODALIDA 2011 Conference Proceedings, pp. 224-229, 2011, both of which are hereby incorporated by reference in their entireties as if fully set forth herein.