Some of the main functions of computer technology are to help people efficiently store large amounts of information, accurately cluster the information, and quickly locate some piece of key information that they need. Searching and retrieval may be carried out online through networks or offline on bulk storage systems.
Prior information search and retrieval methods have used clustering techniques and a Vector Space Model (VSM), where each unique word within a collection of documents represents a dimension in space, and where each document represents a vector within that multidimensional space. Vectors that are close together in this multidimensional space form clusters, or groups of documents that are similar. The quality of information retrieval and data clustering is usually judged by two metrics: precision and recall. Precision refers to the percentage of documents retrieved that are relevant to the query, and recall reflects the percentage of all relevant documents that have been retrieved.
Examples of such systems are disclosed in Potok et al., U.S. Pat. No. 7,072,883 and Potok et al. US2003/0120639.
Attempts to improve the precision and recall of information retrieval and data categorization are often hindered by two characteristics of textual data: the synonymy (multiple words with the same meaning) and polysemy (a single word with multiple meanings) that exist in languages, and the high dimensionality of the data (each unique word in a document is a dimension). Latent Semantic Indexing (LSI) is known as one of the most effective solutions to these problems. The underlying technology of LSI is the truncated singular value decomposition (SVD). Besides the fact that this technique can alleviate the negative impact of synonymy and polysemy, it also reduces the number of dimensions of a VSM, and therefore reduces the amount of space required to store information.
A technical problem is that computing SVD is computationally expensive, meaning that it takes a long time to compute the results. Therefore, it cannot be used to process high volume data streams, where new data comes into the system at high frequency. Most recent work in this area has mainly focused on inventing incremental SVD updating schemes. However, it is mathematically provable that SVD updating schemes can never reach linear computational complexity.