1. Field of the Invention
The present invention relates generally to information processing and data retrieval, and in particular to text processing.
2. Background
Data retrieval is of utmost importance in the current Age of Information. Presently, there are myriads of documents available in electronic form, accessible via the Internet, and stored in such places as proprietary databases, microcomputer hard drives, hand-held devices, etc. In the future, the number of electronic documents available, and the rate at which these documents are produced, will only increase. Amid this vast sea of information, a user must be able to locate and retrieve documents of interest.
One well-known approach for locating and retrieving documents of interest is a keyword search. In a keyword search, a document is located and retrieved if the word(s) of a user's query explicitly appear in the document. However, there are at least two problems with this approach. First, a keyword search will not retrieve a document that is conceptually relevant to the user's query if the document does not contain the exact word(s) of the query. Second, a keyword search may retrieve a document that is not conceptually relevant to the intended meaning of a user's query. This may occur because words often have multiple meanings or senses. For example, the word “tank” has a meaning associated with “a military vehicle” and a meaning associated with “a container.”
A technique called Latent Semantic Indexing (LSI) offers a superior alternative to simple keyword searching. LSI is described, for example, in commonly-owned U.S. Pat. No. 4,839,853 to Deerwester et al., the entirety of which is incorporated by reference herein. According to LSI, a mathematical vector space, called an LSI space, is used to represent a collection of documents and terms contained in that collection of documents. In the LSI technique, a document is determined to be conceptually relevant to a user's query based on the proximity between the document and the user's query, wherein proximity is measured in the LSI space. The performance of LSI-based document retrieval far exceeds that of keyword searching because documents that are conceptually similar to the query are retrieved even when the user's query and the retrieved documents use different terms to describe similar concepts.
An LSI space is created from a collection of documents. Each document in the collection, and each unique term contained in the documents, has a vector representation in the LSI space. According to current LSI techniques, however, the LSI space is typically recreated each time the influence of additional documents and/or terms is to be included. That is, a new LSI space is created based on the original collection of documents and the additional documents and/or terms. Given the current rate at which documents are created, and that this rate will likely increase in the future, this method for incorporating the influence of additional documents and/or terms is problematic because creating an LSI space is a computationally expensive and time consuming process.
Therefore, what is needed is a method and computer program product for perturbing an abstract mathematical vector space that represents a collection of documents (e.g., an LSI space) to incorporate the influence of additional documents and/or terms. Such a method and computer program product should not require that the abstract mathematical vector space be recreated each time the influence of additional documents and/or terms is to be included. In addition, such a method and computer program product should allow the influence of documents and/or terms to be removed from the abstract mathematical vector space without requiring that the abstract mathematical vector space be recreated.