The present invention relates generally to systems for searching document sets, and more specifically to an advanced system for cross language document retrieval.
Latent Semantic Analysis
Latent Semantic Analysis (LSA) is a promising departure from traditional models. The LSA method attempts to provide intelligent agents with a process of semantic acquisition. Researchers at Bellcore (Deerwester et al., 1990, No. 11 in Appendix A; Berry et al, 1995, No. 5 in Appendix A; Dumais et al, 1991 and 1998, Nos. 11 and 12 in Appendix A) have described a computationally intensive algorithm known as Latent Semantic Indexing (LSI). LSI is an unsupervised classification technique based on a matrix factorization method. Cognitive scientists have shown that the performance of LSI on multiple-choice vocabulary and domain knowledge tests emulates expert essay evaluations (Foltz et al, 1998, No. 16 in Appendix A; Kintsch, in press, No. 18 in Appendix A; Landauer and Dumais, 1997, No. 20 in Appendix A; Landauer et al., 1997 and 1998, Nos. 22 and 23 in Appendix A; Wolfe et al., 1998, No. 37 in Appendix A). LSI is based on Singular Value Decomposition (SVD). Bartell et al. (1996), No. 3 in Appendix A, have shown that LSI is an optimal special case of multidimensional scaling. The aim of all indexing schemes which are based on multivariate analysis or unsupervised classification methods is to automate the process of clustering and categorizing documents by topic. An expensive precursor was the method of repertory hypergrids, which requires expert rating of knowledge chunks against a number of discriminant traits (Boose, 1985, No. 6 in Appendix A; Waltz and Pollack, 1985, No. 36 in Appendix A; Bernstein et al., 1991, No. 4 in Appendix A; Madigan et al., 1995, No. 24 in Appendix A). While theoretically appealing, this approach has serious limitations. First, it typically takes several hours to index tens of thousands of documents. Additionally, lack of scalability limits the amount of information that is available for semantic learning. This in turn places a serious limitation on the precision of the search. Lack of scalability has also prevented the extension of the LSI technique to cross language semantic analysis, a field in which it holds much promise.
Cross Language Document Retrieval
The Internet is a multilingual universe where travel is limited by the speed of indexing. However, existing search portals do not equalize the accessibility of information across languages. No existing search engine indexes more than 30% of the Web. This results, at least in part, from technological limitations, which have to do with the speed and scalability of existing Web crawling technology, and the availability of network bandwidth. Also, many existing sites cannot maintain up-to-date indices because indexing technology has not been fully integrated with a database management system. Whenever possible, existing Web robots and crawlers limit indexing to pages in the language that is most likely the language of a regional audience. The assumption on which these limitations are based is that user information cannot be matched to requirements for more than one language at a time, and that information in a foreign language is of no interest to a general user. Experiments in monolingual search with foreign language portals point to the segmentation of the Internet space into cultural and linguistic provinces. Accumulating background information in many foreign languages at once is a significant technical challenge. For example, how can a system measure the reaction of the Italian, Greek, Croatian, Russian people to events in nearby Kosovo? Opinions on such a subject are expressed in home pages, articles, editorials and chat rooms in many languages. It would be desirable to weight articles and opinions across languages and isolate the most relevant clusters of information for translation.
Furthermore, any algorithm applied to cross language document retrieval should be scalable to very large information matrices. An effective system could power the first truly international search portal. Multilingual search provided through such a portal could change the overall dynamics and structure of the Internet, upset its cultural imbalance, and open new markets. Today, seventy-five to eighty percent of Web content, including many authority pages, is in English. The great majority of Internet users are from English speaking countries. Many American users are not multilingual, or find it difficult to formulate a query in other languages. The converse is true of many foreign users, even those with an elementary reading knowledge of English. It would therefore be desirable for Web surfers to be able to express queries or examples in the language in which they are most competent, and obtain relevant text passages in any language. Automatic translation engines, referred to as Machine Translators (MT), could then be applied to selectively convert some of this information in the source language. Examples of existing Machine Translators include Babelfish(trademark) as provided by the AltaVista Company, and NeuroTran(trademark) provided by Translation Experts, Ltd. Multilingual search technology could also improve monolingual search in more than one way. The omission of many foreign language pages from the relevant indices destroys the integrity of the link structure of the Web. As a result, for example, the HTML page of a foreign researcher or a foreign institution may never be found, even if it points to a publication in the English language. In addition, multilingual search capabilities could resolve keyword and concept ambiguities across languages.
Existing Approaches
A direct approach to multilingual interrogation is to use existing Machine Translation (MT) systems to automatically translate an entire textual database from every single language into the language of the user. This approach is clearly unrealistic for the Internet, due to the size of the target search space. Moreover, MT syntax errors, and, more significantly, errors in translating concepts make it technically unsuitable for other multilingual database collections in general. A variation on this approach is multilingual interrogation. In multilingual interrogation, the idea is to translate the query from a source language to multiple target languages, for example, using inter-lingual dictionaries and knowledge bases. In addition, translation into different languages must account for the fact that concepts expressed by a single term in one language sometimes are expressed by multiple distinct terms in another. For example, the term xe2x80x9ctempoxe2x80x9d in Italian corresponds to two different concepts in English: time and weather.
Existing approaches based on creation of inter-lingual pivot concepts require the introduction of keyword tags that can discriminate between word meanings in different languages. This controlled vocabulary approach cannot account for all semantic variations in all languages, and often prohibits precise queries that are not expressed with the authorized keywords. A more data driven approach consists of deducing, during indexing, the keywords that would be supplied for a document from the terms contained in the full-text or summary of the document. Unfortunately, the creation of these directories is time consuming. It can be done either manually by a team of experts, or by an automatic learning process from previously indexed documents. Again, linking different languages requires the introduction of a pivot language.
Still another existing approach consists of combining machine translation methods with information retrieval methods. This approach has been developed by the European ESPRIT consortium in the project EMIR (European Multilingual Information Retrieval) (EMIR, 1994, No. 15 in Appendix A) This system uses three main tools: 1) linguistic processors (morphological and syntactic analysis) which perform grammatical tagging, identify dependency relations and normalize the representation of uniterms and compounds; 2) a statistical model which is used to weight the query-document intersection; 3) a monolingual and multilingual reformulation system whose aim is to infer, from the original natural language query words, all possible expressions of the same concept that can occur in the document, whatever the language. Tests with a trilingual (English, French and German) version of the Cranfield corpus show that multilingual interrogation is 8% better than using MT followed by monolingual interrogation. However, this system has yet to demonstrate scalability and ease of extension to other languages.
The most promising automated approach to cross language retrieval is an extension of LSI given by Dumais et al. (1996 and 1997, Nos. 13 and 1 in Appendix A) and known as CL-LSI (Cross-Language LSI). In a vector space model, documents for which there exist a translation into multiple languages can be observed in language subspaces. CL-LSI approximates these language subspaces by the usual eigenvector decomposition. By identifying and aligning principal axes for the various languages, the LSI algorithm correlates clusters of documents across the various language subspaces. The alignment is made possible by 1) cross-language homonyms, and 2) the general statistics of term distributions in a reasonably large training collection. Testing on a sample of 2,500 paragraphs from the Canadian Parliament bilingual corpus (the Hensard collection), has demonstrated that cross-language retrieval with LSI is equivalent to monolingual interrogation of a fully translated database.
An inverse inference engine for high performance Web searching is disclosed, which includes a superior method for performing Latent Semantic Analysis, in which the underlying search problem is cast as a Backus-Gilbert (B-G) inverse problem (Press et. al, 1997, No. 32 in Appendix A). Improved efficiency is provided by the inverse inference engine as a result of solving an optimization problem for the distance between a transformed query vector and document clusters directly in a transform space. Semantic bases approximate the query in this transform space. Bases with negative coefficients contain the latent semantic information. The inverse inference engine may be applied to a search tool that returns a list of direct document hits and a list of latent document hits in response to a query. The Inverse Inference approach of the disclosed system is a new approach to Latent Semantic Analysis (LSI), that unlike LSI is fast and scalable, and therefore applicable to the task of cross language semantic analysis.
An extension of the inverse inference engine provides cross language document retrieval in a way that is scalable to very large information matrices. In contrast to previous approaches using cross-language LSI (CL-LSI), the disclosed system for cross language document retrieval uses the much faster inverse inference engine, instead of SVD, to perform matrix reduction. In the disclosed cross-language search extension to the inverse inference engine, the list of direct document hits may contain local language document hits, while the list of latent document hits may contain foreign language document hits. In addition to performing cross language document retrieval, the disclosed search technology also provides automatic tools for accelerating the construction of a multilingual lexicon, and for extracting terminology from multilingual corpora of texts.
In the disclosed cross language document retrieval system, the information matrix used as input to the inverse inference engine is organized into blocks of rows corresponding to languages within a predetermined set of natural languages. For example, using a predetermined language set consisting of English, French and Italian, an illustrative information matrix would consist of 3 sections of rows, a first of which is associated with English keywords, a second of which is associated with Italian keywords, and a third of which is associated with French keywords. Columns of entries within the first section of rows in the information matrix represent documents in English, columns of entries within the second section of rows represent documents in French, and columns of entries within the third section of rows represent documents in Italian.
The information matrix is further organized column-wise into two main partitions. The first partition is a left-hand side column vector of blocks of entries representing fully translated documents, which may referred to as the xe2x80x9creference documentsxe2x80x9d, or xe2x80x9ctraining set.xe2x80x9d The second partition is a matrix of blocks of entries representing documents for which translations are not available in all of the predetermined languages, including a number of sets of columns corresponding to the languages in the predetermined language set. Further in the second partition, entries in blocks outside the main diagonal of blocks contain zero values. In other words, those entries in blocks along the main diagonal within the second partition represent the contents of those documents for which full translations are not available, and which make up the target search space.
Another extension to the inverse inference retrieval document retrieval system is disclosed that supports automatic, knowledge based training. This approach generalizes the idea of using a training set, as described in connection with cross language document retrieval, to the problem of searching databases including information that is diluted or not reliable enough to allow the creation of robust semantic links.