The following relates to the information processing arts, information retrieval arts, cross-lingual natural language processing arts, and related arts.
Information retrieval systems provide a user-friendly interface by which a user can retrieve documents from a database that are relevant to or match a query. Typically, an information retrieval system ranks a “top N” documents that best match the query. An example of such a system is an Internet search engine.
In a simple approach, the information retrieval can operate by identifying documents of the database that contain the same words as those specified in the query. In this approach, the query “President Clinton” retrieves documents that include the terms “President” and “Clinton”. However, this approach does not facilitate ranking; for example, if each of five different documents contain all words of the query, then there is no mechanism by which they can be relatively ranked. It may also fail if a relevant document contains most, but not all, terms of a query.
In more sophisticated approaches, a language model is generated to represent the distribution of vocabulary terms in a document. One language model is PML(w|d) where w represents a vocabulary term, d represents the document being modeled, and PML(w|d) represents the language model representing a statistical distribution or likelihood of term w in the document d computed using a maximum likelihood estimation. The language model is typically smoothed, for example P(w|d)=λ·PML(w|d)+(1−λ)PML(w|C) where PML(w|C) is the maximum-likelihood language model for the corpus C representing a statistical distribution or likelihood of term w in the corpus C, P(w|d) is the smoothed language model, and A controls the amount of smoothing. Such smoothing ensures a non-zero occurrence language model probability for vocabulary terms that do not occur in the document d.
A language model provides a better metric for ranking the documents respective to the query, and facilitates relatively ranking different documents. If the query is represented as a bag-of-words q={q1, . . . , ql} where the terms q1, . . . , ql are the contents of the bag of words query q, then the probability that the query q would be generated by a document d can be estimated as
      P    ⁡          (              q        |        d            )        =            ∏              i        =        1            l        ⁢                  ⁢          P      ⁡              (                              q            i                    |          d                )            where P(qi|d) are the outputs of the language model of document d for query terms qi.
A known extension is pseudo-relevance feedback. In this approach, a first retrieval operation is performed to obtain N most relevant documents, where N is an integer and N>0. A language model is derived for the N documents, and is used to update or enrich the original query. The idea is that the N most relevant documents are likely to be highly related to the subject matter of the query, and so similarities amongst the N most relevant documents provide information for enriching the query. In one approach, vocabulary terms that were not included in the original query but are highly probable in the language model of the top-N documents may be added to the original query to generate an improved query that is used in a second retrieval operation.
The foregoing relates to mono-lingual information retrieval, in which the query and the corpus of documents are both in the same language. Cross-lingual information retrieval systems extend this concept by retrieving documents from a corpus of documents in a target natural language based on a query formulated in a different source natural language. For example, the query may be formulated in English, but the corpus of documents may be in German.
Cross-lingual information retrieval facilitates retrieval of documents from a multi-lingual corpus (for example, containing documents in English, French, and German) using a single source-language query (for example, a query in English). As another application, cross-lingual information retrieval enables a person not fluent in the language of a mono-lingual corpus to formulate a query in his or her native language, and thus to retrieve the most relevant documents. The user can then obtain human and/or machine translations only of the retrieved most relevant documents.
Cross-lingual information retrieval can be performed by translating the query from the source language into the target language and performing the query on the documents in the target language using the translated query. That is, the query is moved into the target language domain and henceforth the retrieval system operates in the target language domain. This leverages the monolingual indexing and retrieval machinery of the corpus, and entails only constructing a “front end” to perform the query translation. However, cross-lingual information retrieval is dependent upon accurate translation of the query from the source language to the target language. This can be difficult due to the possibly short length of the query and generally known difficulties in automated translation between different natural languages.
Cross-lingual dictionaries are sometimes automatically generated as lexicons extracted from parallel corpora (that is, a corpus in which the same semantic textual content is represented in both the source and target languages). Such automatically generated dictionaries can contain errors due to misalignment of nominally parallel text, uncertainties in the extraction processes, and other errors introduced in the automated lexicon generation. They are also typically noisy, because they are automatically extracted from an aligned corpus. Noise can be introduced by mistakes in extraction algorithm (for example, the extraction of entries is computed with an alignment between words and reliance upon statistics). Noise can also be introduced when a dictionary is used in another domain. For example, if a dictionary is extracted from a news corpora and is then used for processing of a social sciences or humanities article, the vocabulary will likely be substantially different for these two different domains of data, and these differences may result in inappropriate word correlations. Another translation problem is polysemy, wherein a given source language term of the query may have different or multiple meanings calling for different translations in the target language. For example, the English term “bank” can mean either “financial institution” or “edge of a river”. To properly translate “bank” into a language other than English one must know which meaning applies.
In the context of information retrieval, pseudo-relevance feedback is an attractive possibility for overcoming such translational difficulties, since the feedback documents provide potentially relevant contextual information. However, adapting pseudo-relevance feedback to cross-lingual information retrieval has heretofore been problematic. For example, one approach that has been attempted is to build language models of the target-language corpus documents in the source language, and to perform mono-lingual information retrieval including pseudo-relevance feedback using those source-language models. However, this approach is computationally intensive, and has been found to generate unacceptably noisy feedback.
It is generally more computationally efficient to translate the query into the target language and to perform the information retrieval in the target language domain. In that case, one might consider performing pseudo-relevance feedback entirely in the target language domain, after initial translation of the query. However, when the query includes polysemic terms, this approach will likely fail. The initial translation of the query into the target language generally includes all possible alternative target-language translations for the polysemic query term, since there is no basis for selecting between the alternative translations. All but one of those alternative target-language translations will generally be incorrect. Existing pseudo-relevance feedback techniques are not designed to remove incorrect terms, but rather to add additional terms so as to enrich the query. Hence, pseudo-relevance feedback performed entirely in the target language domain is not well-suited for addressing polysemic queries.