Latent Semantic Indexing (LSI) is an advanced information retrieval (IR) technology that is a variant of the vector retrieval method that exploits dependencies or “semantic similarity” between terms. It is assumed that there exists some underlying or “latent” structure in the pattern of word usage across data objects, such as documents, and that this structure can be discovered statistically. One significant benefit of this approach is that, once a suitable reduced vector space is computed for a collection of documents, a query can retrieve documents similar in meaning or concepts even though the query and document have no matching terms.
An LSI approach to information retrieval is detailed in commonly assigned U.S. Pat. No. 4,839,853 applies a singular-value decomposition (SVD) to a term-document matrix for a collection, where each entry gives the number of times a term appears in a document. A large term-document matrix is typically decomposed to a set of approximately 150 to 300 orthogonal factors from which the original matrix can be approximated by linear combination. In the LSI-generated vector space, terms and documents are represented by continuous values on each of these orthogonal dimensions; hence, are given numerical representation in the same space. Mathematically, assuming a collection of m documents with n unique terms that, together, form an n×m sparse matrix E with terms as its rows and the documents as its columns, each entry in E gives the number of times a term appears in a document. In the usual case, log-entropy weighting (log(tf+1)entropy) is applied to these raw frequency counts before applying SVD. The structure attributed to document-document and term-term dependencies is expressed mathematically in equation (1) as the SVD of E:E=U(E)Σ(E)V(E)T  (1)where U(E) is an n×n matrix such that U(E)TU(E)=In, Σ(E) is an n×n diagonal matrix of singular values and V(E) is an n×m matrix such that V(E)TV(E)=Im, assuming for simplicity that E has fewer terms than documents.
Of course the attraction of SVD is that it can be used to decompose E to a lower dimensional vector space k as set forth in the rank-k reconstruction of equation (2).Ek=Uk(E)Σk(E)Vk(E)T  (2)
Because the number of factors can be much smaller than the number of unique terms used to construct this space, words will not be independent. Words similar in meaning and documents with similar content, based on the words they contain, will be located near one another in the LSI space. These dependencies enable one to query documents with terms, but also terms with documents, terms with terms, and documents with other documents. In fact, the LSI approach merely treats a query as a “pseudo-document,” or a weighted vector sum based on the words it contains. In the LSI space, the cosine or dot product between term or document vectors corresponds to their estimated similarity, and this measure of similarity can be exploited in interesting ways to query and filter documents. This measure of correspondence between query vector q and document vector d is given by equation (3).sim(Uk(E)Tq,Uk(E)Td)  (3)In “Using Linear Algebra for Intelligent Information Retrieval” by M. Berry et al., SIAM Review 37(4): pp. 573–595 the authors provide a formal justification for using the matrix of left singular vectors Uk(E) as a vector lexicon.
Widespread use of LSI has resulted in the identification of certain problems exhibited by LSI when attempting to query massive heterogeneous document collections. An SVD is difficult to compute for extremely large term-by-document matrices, and the precision-recall performance tends to degrade as collections become very large. Surprisingly, much of the technical discussion surrounding LSI has focused on linear algebraic methods and algorithms that implement these, particularly problems of applying SVD to massive, sparse term-document matrices. Evaluations of the effect of changing parameters, e.g., different term weightings and the number of factors extracted by SVD, to the performance of LSI have been performed. Most of the approaches to make LSI scale better have been sought from increasing the complexity of LSI's indexing and search algorithms.
LSI is limited as an information retrieval and text mining strategy when document collections grow because with large collections there exists an increasing probability of drawing documents from different conceptual domains. This has the effect of increasing the semantic heterogeneity modeled in a single LSI vector space, thus of introducing noise and “confusing” the LSI search algorithm. As polysemy becomes more pronounced in a collection, vectors for terms tend to be represented by the centroid of all vectors for each unique meaning of the term, and since documents vectors are computed from the weighted sum of vectors for the terms they contain, the semantics of these are also confounded.
In general, the number of conceptual domains grows with the size of a document collection. This may result from new concepts being introduced into the information space, or an existing concept becoming extremely large (in number of documents) with further differentiation of its sub-concepts. In both cases, the compression factor in any vector space-based method has to be increased to accommodate this inflation.
The deleterious effects of training on a large conceptually undifferentiated document collection are numerous. For example, assume that documents drawn from two conceptual domains, technology and food, are combined without sourcing into a single training set and that LSI is applied to this set to create a single vector space. It is easy to imagine how the semantics of these two domains might become confused. Take for instance the location of vectors representing the terms “chip” and “wafer.” In the technology domain, the following associations may be found: silicon chip, silicon wafer, silicon valley, and copper wafer. However, in the food domain the terms chip and wafer take-on different meanings and there may be very different semantic relationships: potato chip, corn chip, corn sugar, sugar wafer. But these semantic distinctions become confounded in the LSI vector space. By training on this conceptually undifferentiated corpus, vectors are computed for the shared terms “chip” and “wafer” that really don't discriminate well the distinct meanings that these terms have in the two conceptual domains. Instead, two semantically “diluted” vectors that only represent the numerical average or “centroid” of each term's separate meaning in the two domains is indexed.
Therefore, it would be desirable to have a method and system for performing LSI-based information retrieval and text mining operations that can be efficiently scaled to operate on large heterogeneous sets of data.
Furthermore, it would be desirable to have a method and system for performing LSI-based information retrieval and text mining operations on large data sets quickly and accurately.
Additionally, it would be desirable to have a method and system for performing LSI-based information retrieval and text-mining operations on large data sets without the deleterious effects of mixing conceptually differentiated data.
Also, it would be desirable to have a method and system for the processing of large document collections into a structure that enables development of similarity graph networks of sub-collections having related concept domains.
Additionally, it would be desirable to have a method and system that enables a user to query the document collection in a flexible manner so that the user can specify the degree of similarity necessary in search results.