The following description relates to information management systems and techniques for information retrieval.
In a situation where people wish to find information in a large collection of documents that have been stored electronically, some form of search technology may be employed. A system that employs search technology is known as an information management system and may include a data repository where collections of documents are stored. A user of an information management system may wish to search documents scattered within a data repository to obtain various types of information. For example, a user may wish to extract statistical information about terms used in documents or identify sets of documents that are similar to given documents. The technology employed to achieve those ends may be described generically as text-mining functionality.
In many implementations, text-mining functionality is based on a mathematical model called the vector space model. In the vector space model, terms may correspond to dimensions in a vector space, and documents may correspond to vectors, such that each nonzero component of a document vector corresponds to a term that appears in a corresponding document. A matrix of document vectors and the terms in the document vectors may be known as a term-document matrix.
In the scenario where a user wishes to identify sets of documents that are similar to one or more given documents, the time and resources required to calculate similarity are considerable. For example, in some implementations of the vector space model, if similarity values are calculated over a term-document matrix for millions of documents, the running time may range from multiple hours to several days. In addition, the task of obtaining quality results and minimizing the calculation overhead may be challenging.