As collections of natural language documents become large, tools are need to browse, search, manipulate, analyze, and manage such collections. In particular, searching for similar documents within a collection of documents has an important role in text mining and document management. For example, the capability to search for similar documents plays a key functionality in many business enterprise applications.
Many existing search techniques are based on searching for matching strings in documents to locate similar documents. That is, similar documents can be found based on matching of keywords between the documents. For example, Latent Semantic Indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called Singular value decomposition (SVD) to identify patterns in relationships between texts. Additionally, LSI has the capability to extract the conceptual content of a body of texts by establishing associations between those texts that occur in similar contexts. LSI is based on the principle that words that are used in the same contexts tend to have similar meanings.
However, the methodology used by LSI to extract the conceptual context is notoriously inaccurate because often, the texts themselves do not provide sufficient context information. For example, many technical documents include abbreviations for names and phrases, and the determination of context based on the abbreviations is often inaccurate because of an inherent mismatch between the actual names and abbreviations.