The present disclosure relates generally to semantic search technologies, and more specifically to automatically linking text to concepts in a knowledge base.
Traditional information retrieval technologies are based on the indexing of data using keywords. A drawback of using keywords to index data is that when a query keyword is not present in a document, the document is not considered for a possible match to the query. One approach that has been employed to improve this situation is the application of query expansion techniques in which the original query is substituted with variants. When using query expansion techniques, a search term is substituted with a synonym, independent searches are performed on the search term and the synonym, and then the searches subsequently joined. In principle, the larger the number of different variants that are tried at query time, the higher the potential quality of the returned results, assuming that a good joining technique is in place. One disadvantage to query expansion techniques is that the need for very fast response times implies that these techniques need to be relatively inexpensive, or alternately, that there are significant computational resources devoted to a search query so as to be able to complete the independent queries on time. This disadvantage is made particularly acute when doing searches with extremely rich ontologies because the query expansion may take a single search and turn it into potentially tens of thousands of searches.
Within the field of semantic reasoning techniques, one of the most popular techniques is latent semantic analysis (LSA). LSA functions by projecting a document's representation, typically in the form of a term-frequency/inverse-document-frequency vector, to a smaller space called the latent semantic space. This projecting is performed using matrix factorization techniques such as singular value decomposition (SVD), or by using statistical inference techniques such as expectation/maximization, as used in probabilistic latent semantic indexing (PLSI). Every dimension in this lower dimensional space is meant to represent an abstract concept that is generated automatically by the LSA technique. The LSA idea has given rise to numerous other variants that generally share the attributes described above. A disadvantage of the LSA family of techniques is that by themselves, they are not able to take advantage of the large volumes of crowd sourced data that have become available through the popularity of websites such as Wikipedia.