The present invention relates generally to systems for searching document sets, and more specifically to an advanced system for cross language document retrieval.
Latent Semantic Analysis
Latent Semantic Analysis (LSA) is a promising departure from traditional models. The LSA method attempts to provide intelligent agents with a process of semantic acquisition. Researchers at Bellcore (Deerwester et al., 1990, No. 11 in Appendix A; Berry et al, 1995, No. 5 in Appendix A; Dumais et al, 1991 and 1998, Nos. 11 and 12 in Appendix A) have described a computationally intensive algorithm known as Latent Semantic Indexing (LSI). LSI is an unsupervised classification technique based on a matrix factorization method. Cognitive scientists have-shown that the performance of LSI on multiple-choice vocabulary and domain knowledge tests emulates expert essay evaluations (Foltz et al, 1998, No. 16 in Appendix A; Kintsch, in press, No. 18 in Appendix A; Landauer and Dumais, 1997, No. 20 in Appendix A; Landauer et al., 1997 and 1998, Nos. 22 and 23 in Appendix A; Wolfe et al., 1998, No. 37 in Appendix A). LSI is based on Singular Value Decomposition (SVD). Bartell et al. (1996), No. 3 in Appendix A, have shown that LSI is an optimal special case of multidimensional scaling. The aim of all indexing schemes which are based on multivariate analysis or unsupervised classification methods is to automate the process of clustering and categorizing documents by topic. An expensive precursor was the method of repertory hypergrids, which requires expert rating of knowledge chunks against a number of discriminant traits (Boose, 1985, No. 6 in Appendix A; Waltz and Pollack, 1985, No. 36 in Appendix A; Bernstein et al., 1991, No. 4 in Appendix A; Madigan et al., 1995, No. 24 in Appendix A). While theoretically appealing, this approach has serious limitations. First, it typically takes several hours to index tens of thousands of documents. Additionally, lack of scalability limits the amount of information that is available for semantic learning. This in turn places a serious limitation on the precision of the search. Lack of scalability has also prevented the extension of the LSI technique to cross language semantic analysis, a field in which it holds much promise.
Cross Language Document Retrieval
The Internet is a multilingual universe where travel is limited by the speed of indexing. However, existing search portals do not equalize the accessibility of information across languages. No existing search engine indexes more than 30% of the Web. This results, at least in part, from technological limitations, which have to do with the speed and scalability of existing Web crawling technology, and the availability of network bandwidth. Also, many existing sites cannot maintain up-to-date indices because indexing technology has not been fully integrated with a database management system. Whenever possible, existing Web robots and crawlers limit indexing to pages in the language that is most likely the language of a regional audience. The assumption on which these limitations are based is that user information cannot be matched to requirements for more than one language at a time, and that information in a foreign language is of no interest to a general user. Experiments in monolingual search with foreign language portals point to the segmentation of the Internet space into cultural and linguistic provinces. Accumulating background information in many foreign languages at once is a significant technical challenge. For example, how can a system measure the reaction of the Italian, Greek, Croatian, Russian people to events in nearby Kosovo? Opinions on such a subject are expressed in home pages, articles, editorials and chat rooms in many languages. It would be desirable to weight articles and opinions across languages and isolate the most relevant clusters of information for translation.
Furthermore, any algorithm applied to cross language document retrieval should be scalable to very large information matrices. An effective system could power the first truly international search portal. Multilingual search provided through such a portal could change the overall dynamics and structure of the Internet, upset its cultural imbalance, and open new markets. Today, seventy-five to eighty percent of Web content, including many authority pages, is in English. The great majority of Internet users are from English speaking countries. Many American users are not multilingual, or find it difficult to formulate a query in other languages. The converse is true of many foreign users, even those with an elementary reading knowledge of English. It would therefore be desirable for Web surfers to be able to express queries or examples in the language in which they are most competent, and obtain relevant text passages in any language. Automatic translation engines, referred to as Machine Translators (MT), could then be applied to selectively convert some of this information in the source language. Examples of existing Machine Translators include Babelfish™ as provided by the AltaVista Company, and NeuroTran™ provided by Translation Experts, Ltd. Multilingual search technology could also improve monolingual search in more than one way. The omission of many foreign language pages from the relevant indices destroys the integrity of the link structure of the Web. As a result, for example, the HTML page of a foreign researcher or a foreign institution may never be found, even if it points to a publication in the English language. In addition, multilingual search capabilities could resolve keyword and concept ambiguities across languages.
Existing Approaches
A direct approach to multilingual interrogation is to use existing Machine Translation (MT) systems to automatically translate an entire textual database from every single language into the language of the user. This approach is clearly unrealistic for the Internet, due to the size of the target search space. Moreover, MT syntax errors, and, more significantly, errors in translating concepts make it technically unsuitable for other multilingual database collections in general. A variation on this approach is multilingual interrogation. In multilingual interrogation, the idea is to translate the query from a source language to multiple target languages, for example, using inter-lingual dictionaries and knowledge bases. In addition, translation into different languages must account for the fact that concepts expressed by a single term in one language sometimes are expressed by multiple distinct terms in another. For example, the term “tempo” in Italian corresponds to two different concepts in English: time and weather.
Existing approaches based on creation of inter-lingual pivot concepts require the introduction of keyword tags that can discriminate between word meanings in different languages. This controlled vocabulary approach cannot account for all semantic variations in all languages, and often prohibits precise queries that are not expressed with the authorized keywords. A more data driven approach consists of deducing, during indexing, the keywords that would be supplied for a document from the terms contained in the full-text or summary of the document. Unfortunately, the creation of these directories is time consuming. It can be done either manually by a team of experts, or by an automatic learning process from previously indexed documents. Again, linking different languages requires the introduction of a pivot language.
Still another existing approach consists of combining machine translation methods with information retrieval methods. This approach has been developed by the European ESPRIT consortium in the project EMIR (European Multilingual Information Retrieval) (EMIR, 1994, No. 15 in Appendix A). This system uses three main tools: 1) linguistic processors (morphological and syntactic analysis) which perform grammatical tagging, identify dependency relations and normalize the representation of uniterms and compounds; 2) a statistical model which is used to weight the query-document intersection; 3) a monolingual and multilingual reformulation system whose aim is to infer, from the original natural language query words, all possible expressions of the same concept that can occur in the document, whatever the language. Tests with a trilingual (English, French and German) version of the Cranfield corpus show that multilingual interrogation is 8% better than using MT followed by monolingual interrogation. However, this system has yet to demonstrate scalability and ease of extension to other languages.
The most promising automated-approach to cross language retrieval is an extension of LSI given by Dumais et al. (1996 and 1997, Nos. 13 and 1 in Appendix A) and known as CL-LSI (Cross-Language LSI). In a vector space model, documents for which there exist a translation into multiple languages can be observed in language subspaces. CL-LSI approximates these language subspaces by the usual eigenvector decomposition. By identifying and aligning principal axes for the various languages, the LSI algorithm correlates clusters of documents across the various language subspaces. The alignment is made possible by 1) cross-language homonyms, and 2) the general statistics of term distributions in a reasonably large training collection. Testing on a sample of 2,500 paragraphs from the Canadian Parliament bilingual corpus (the Hensard collection), has demonstrated that cross-language retrieval with LSI is equivalent to monolingual interrogation of a fully translated database.