Aspects of the exemplary embodiment disclosed herein relate to cross language information retrieval and find particular application in connection with a system and method for translation of a query which considers potential ambiguity in the target domain.
Cross-Lingual Information Retrieval (CLIR) systems for retrieving documents in one language based on a query input in another language can provide useful tools, particularly when the domain of interest is largely in a different language from that of an information searcher. The input query is first translated, using an automatic machine translation system, into the language used in the target documents and then input to a search engine for querying a selected document collection.
One problem which arises is that Statistical Machine Translation (SMT) systems designed for general text translation tend to perform poorly when used for query translation. SMT systems are often trained on a corpus of parallel sentences, which have been automatically extracted from a parallel corpus of documents. The documents in the corpus are assumed to be translations of each other, at least in the source to target direction. The trained SMT systems thus implicitly take into account the phrase structure. However, the structure of queries can be very different from the standard phrase structure used in general text: Queries are often very short and the word order can be different from the typical full phrase which would be used in general text. Having a large number of parallel queries would enable training an SMT system adapted to translation of queries. However, no such corpora are available.
Moreover, even if such training data were to be made available, current SMT systems are usually trained to optimize the quality of the translation (e.g., using the BLEU score for assessing the quality of the translations output by the Moses phrase-based SMT system). This means that for a typical task related to query translation, such as Cross-Lingual Information Retrieval (CLIR) the optimization function used is not correlated with the retrieval quality. For example, the word order which is crucial for good translation quality (and is taken into account by most MT evaluation metrics) is often ignored by IR engines.
Conventional CLIR systems often employ components for query translation, document indexing, and document retrieval. While the translation is often considered independently from the retrieval component, several attempts have been made to bring them together. For example, a probabilistic model embeds the query translation step into the retrieval model. See, for example, Hiemstra, D. and de Jong, F., “Disambiguation Strategies for Cross-Language Information Retrieval, ECDL 1999, pp. 274-293. However, this approach requires access to a document index, which is not feasible in the context of a translation service, where the collection to be searched is often unknown.
Another challenge for CLIR is the ambiguity of the translation. This is especially true in the case of using a generic dictionary/translation service when the queries are seeking for domain-specific information (e.g., in the medical, art, or social science fields). For example, in English the word “bank” can refer to a river bank, a savings bank, a blood bank, the verb “to bank”, or the like. A query such as “where is the left bank?” may not provide the translation system with the correct context. Corresponding words for “bank” in French include “banque,” “banc,” and “rive,” the best selection depending on the context. The shortness of a query may not provide a conventional translation system with the context needed to resolve the ambiguity.
The exemplary embodiment addresses these problems, and others, by integrating ambiguity-reducing features into a machine translation system by using a reranking framework.