In the field of information retrieval, semantic search techniques have been used to build a semantic model from a set of documents (webpages, emails, or documents on a file system, for example), and given a search query, find the set of documents that best relate to that query. The conventional method has been to build an inverted index of all words in a document across all documents, and then using various relevancy metrics, compare the words of the search query (assumed to be another kind of document) against the index, and finding a ranked set of files that are ‘closest’ to the query. In practice, this serves to simulate semantic search because words that represent a semantic concept tend to cluster together in co-occurrences.
Early methods involved techniques such as Latent Semantic Analysis to calculate the singular value decomposition (SVD) of a matrix derived from the inverted document-word index. For reasonable accuracy, one must specify the k number of dimensions in the Eigen decomposition ahead of time, which can dramatically affect overall search results. More recent approaches, based on principled probabilistic models that bypass the resource intensive SVD computation, including probabilistic latent semantic analysis (PLSA) and ranking support vector machines (SVMs), accomplish the same task, but also require that the number of concepts be known at training time.
While this works quite well in limited domains (such as spam-mail filtering), it proves infeasible for full-fledged desktop search. This is typically due to: a) the user has few or no files to create an index, leading to sparsity of data, and therefore sub-optimal searches; b) the user has hundreds of gigabytes of data, leading to huge indexes and inexcusable computation times when building the model; or c) even with a reasonable index size, it can be difficult finding the optimal k parameter for each individual dataset.