In the field of information retrieval, a long-standing objective has been the development of an automated procedure by which documents in one language could be effectively accessed by requests in another language without needing to translate either the documents or the requests. Among other things, such a capability would allow users to determine what documents were available in languages that the users could not read before incurring the expense and delay of translation.
One technique representative of some previously proposed procedures, disclosed in an article entitled "Automatic Processing of Foreign Language Documents," was published by G. Salton in 1970 in the Journal of American Society for Information Sciences. Salton reported experimenting with a method for automatic retrieval of documents in one language in response to queries in another using a vector representation and search technique in conjunction with a manually created dual-language thesaurus. The results for test samples of abstracts and queries were promising. However, creating an adequate multi-language thesaurus is difficult and requires considerable intellectual labor. Moreover, a traditional thesaurus necessarily imposes a discrete and rather restricted model of the languages in question and of their relation to one another.
U.S. Pat. No. 4,839,853, issued to one of the present co-inventors and assigned to the same assignee as is the present invention, utilizes the Latent Semantic Indexing (LSI) approach to model the underlying correlational structure of the distribution of terms in documents. Instead of representing documents and queries directly as sets of words, the LSI technique represents them as parameters in such a way that dependencies between words and between documents are taken into account. For example, if two terms are used in exactly the same contexts, that is, have identical distribution across a target collection of documents, LSI is designed to treat them not as two independent indexing entries but as two instances of an abstract indexing variable with the same vector value. Lesser and more indirect relations between terms and between documents are represented in an appropriate analogous fashion.
In the implementation of LSI as set forth in the above-identified patent, the modeling is accomplished by approximating the original term-by-document matrix by the product of three lower rank matrices of orthogonal derived indexing variables. The first matrix represents terms as values on a smaller set of independent "basis" vectors; the second matrix contains scaling coefficients; and the third matrix represents documents as values on the smaller set of basis vectors. The method can be interpreted geometrically as a means by which each document and each term is assigned to a point in a hyperspace. The mathematics and implementation of the method construct a derived space in which terms, documents, and queries can all be represented in the hyperspace. The mathematical procedure employed is singular value decomposition (SVD), which is closely related to factor analysis and eigenvalue decomposition.
The retrieval process is the same as in standard vector methods, e.g. using document-query cosines as the similarity measure. Various preprocessing steps, such as term weighting, may also be done in standard ways. The principal difference between LSI and previous vector models as represented by the work of Salton is that the vectors are constructed in a space with many fewer dimensions than the number of original terms, and that these dimensions are the subset of linearly independent basis vectors by which the original term-by-document matrix can be best approximated in a least squares sense. The number of dimensions retained has been determined empirically; optimal retrieval performance has usually been obtained with about 100 dimensions for collections of many hundreds to several thousands of documents.
The dimension reduction step of LSI has the advantageous property that small sources of variability in term usage are dropped and only the most important sources kept. Among other things, this can cause synonyms or near synonyms to be collapsed into similar vector representations, with the result that queries can retrieve similar documents even though they share no terms. This cannot happen in the usual raw term vector representation, necessitating manually constructed thesauri with their attendant problems.
The LSI method has previously been applied only within a single language, and there has been no teaching or suggestion in the art regarding the application of LSI to multi-language information retrieval.