1. Field of the Invention
The present invention relates to information retrieval and, more particularly, to a system and method for information retrieval systems employing a transfer corpus to retrieve information based on a query and information in different languages.
2. Description of the Related Art
Systems for retrieving documents given a query in the same language as the documents are widely available, for example web search engines. A commonly used scheme is based on the Okapi formula described in S. E. Robertson et. al., xe2x80x9cSome Simple Effective Approximations to the 2-Poisson Model for Probabilistic Weighted Retrievalxe2x80x9d in Proceedings of the 17th International Conference on Research and Development in Information Retrieval ed. by W. B. Croft and C. J. van Rijsbergen (1994), incorporated herein by reference, which counts the number of words the query and the document have in common and weights the counts by a measure of the rarity of the word. This method is language independent (the query and document can be in any language, as long as it is the same) although simple language-specific linguistic preprocessing steps (e.g. morphological analysis to find root words) improves the performance. This type of linguistic preprocessing is available for many languages.
Several methods for retrieving documents in a language A, given a query in a language B (different from A) xe2x80x9ccrosslanguage information retrieval (CLIR)xe2x80x9d have been described. The two most common techniques are document-translation CLIR and query-translation CLIR. These methods and others have been extensively reviewed as described for document-translation CLIR in D. W. Oard, xe2x80x9cAlternative Approaches for Cross-Language Text Retrievalxe2x80x9d in AAAI Spring Symposium on Cross Language Text and Speech Retrieval (1997) and J. G. Carbonell et. al.,xe2x80x9cTranslingual Information Retrieval: A Comparative Evaluationxe2x80x9d in Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (1997), both incorporated herein by reference. The system described in theses references was based on document-translation CLIR: with a machine translation system, the documents were translated from language A to language B. The translated documents are then indexed by an informational retrieval (IR) system operating in language B, the query language. A query entered into the IR system retrieves a translated document. Of course, locating the original untranslated document is trivial because the original documents and their translations are in one-to-one correspondence.
Another widely used method is query-translation CLIR, in which the queries are translated into language A, and then an IR system operating in language B uses the translated queries to retrieve the documents. Other methods have also been described which, for example, invoke use a parallel corpus of pairs of documents which are known to be translations of each other, for example, as described in S. T. Dumais et. al., xe2x80x9cAutomatic Cross-Language Retrieval Using Latent Semantic Indexingxe2x80x9d in AAAI Symposium on Cross-Language Text and Speech Retrieval, American Association for Artificial Intelligence (1997), incorporated herein by reference, but which do not involve any translation of the documents in the corpus that is being retrieved from.
Therefore, a need exists for a multilingual retrieval information retrieval system in which both queries and documents may be in many different languages. A further need exists for an information retrieval (IR) system which combines pairs of languages to retrieve information between a third pair of languages.
A method for retrieving information, in accordance with the present invention, includes the steps of providing an initial query in a first language, retrieving data in a second language in accordance with the initial query, formulating the query in the second language, retrieving data in a third language in accordance with the query formulated in the second language and outputting data retrieved in the third language in accordance with the initial query.
In alternate methods, the data may be included in documents and the steps of retrieving data in the second and third languages may include the step of retrieving documents and ordering the documents in a decreasing order of relevance of the documents. The initial query may be preprocessed by performing at least one of tokenization, part-of-speech tagging, morphological analysis and stop-word removal. The data may be retrieved from at least one corpus and the method may further include the step of preprocessing data retrieved from the corpus by performing at least one of tokenization, name detection and morphological analysis. The method may further include the step of translating the data from the corpus in accordance with a language of the query. The method may further include the step of indexing the translated data by constructing an inverted index which lists documents in the corpus including elements of the query.
The step of formulating the query may include the step of formulating the query based on contents of the retrieved data. The method may further include a plurality of corpora each corpus having a different language associated therewith, each corpus including documents, the method may further include the steps of retrieving data from each corpus in accordance with a query formulated in a language of a previous corpus, formulating queries in the language of the corpus and retrieving data from a next corpus in accordance with the query formulated in the previous corpus. The method may further include the step of providing an initial query in at least one of a plurality of languages to retrieve documents in the third language. The method may further include the steps of providing corpora in a multiplicity of languages different from the first language and retrieving documents in the multiplicity of languages from the corpora in accordance with the initial query.
Another method for retrieving information based on a query includes the steps of providing an initial query in a first language, providing at least two corpora of information including textual representation of documents, each of the at least two corpora having a different language associated therewith other than the first language, retrieving documents from a transfer corpus of the at least two corpora to provide documents in the language of the transfer corpus in accordance with the initial query, formulating a revised query in the language of the transfer corpus based on the documents retrieved from the transfer corpus and retrieving documents from a target corpus of the at least two corpora to provide documents in the language of the target corpus in accordance with the revised query such that the documents retrieved from the target corpus are responsive to the initial query.
In other methods, the steps of retrieving documents may include the step of ordering the documents in a decreasing order of relevance of the documents. The initial query may be preprocessed by performing at least one of tokenization, part-of-speech tagging, morphological analysis and stop-word removal. The step of preprocessing documents retrieved from the transfer corpus and the target corpus by performing at least one of tokenization, name detection and morphological analysis may be included. The method may further include the step of translating the documents from the transfer corpus and the target corpus in accordance with a language of the initial query and the revised query, respectively. The method may also include the step of indexing the translated documents by constructing an inverted index which lists the documents which include elements of the initial query and the revised query. The method may further include a plurality of corpora as described herein. The method may further include the step of providing an initial query in at least one of a plurality of languages to retrieve documents in the target language. The method may further include the steps of providing corpora in a multiplicity of languages different from the first language and retrieving documents in the multiplicity of languages from the corpora in accordance with the initial query.
A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine to perform method steps for identifying commands in recognized text, the method steps include providing an initial query in a first language, retrieving data in a second language in accordance with the initial query, formulating the query in the second language, retrieving data in a third language in accordance with the query formulated in the second language and outputting data retrieved in the third language in accordance with the initial query.
In alternate embodiments, the program storage device may include the step retrieving documents and ordering the documents in decreasing order of relevance of the documents. The initial query may be preprocessed by performing at least one of tokenization, part-of-speech tagging, morphological analysis and stop-word removal. The data may be retrieved from at least one corpus and the method may further include the step of preprocessing data retrieved from the corpus by performing at least one of tokenization, name detection and morphological analysis. The program storage device may further include the step of translating the data from the corpus in accordance with a language of the query. The program storage device may further include the step of indexing the translated data by constructing an inverted index which lists documents in the corpus including elements of the query. The program storage device, wherein the step of formulating the query may include the step of formulating the query based on contents of the retrieved data. The program storage device may further include a plurality of corpora each corpus having a different language associated therewith, each corpus including documents, the method may included the steps of retrieving data from each corpus in accordance with a query formulated in a language of a previous corpus, formulating queries in the language of the corpus and retrieving data from a next corpus in accordance with the query formulated in the previous corpus. The method may further include the step of providing an initial query in at least one of a plurality of languages to retrieve documents in the third language. The method may further include the steps of providing corpora in a multiplicity of languages different from the first language and retrieving documents in the multiplicity of languages from the corpora in accordance with the initial query.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.