Aspects of the exemplary embodiment disclosed herein relate to a system and method for translation of a query which finds particular application in information retrieval.
Cross-Lingual Information Retrieval (CLIR) systems for retrieving documents in one language based on a query input in another language could provide useful tools, particularly when the domain of interest is largely in a different language from that of an information searcher. The input query is first translated, using an automatic machine translation system, into the language used in the documents and then input to a search engine for querying a document collection.
One problem which arises is that Statistical Machine Translation (SMT) systems designed for general text translation tend to perform poorly when used for query translation. SMT systems are often trained on a corpus of parallel sentences, which have been automatically extracted from a parallel corpus of documents. The documents in the corpus are assumed to be translations of each other, at least in the source to target direction. The trained SMT systems thus implicitly take into account the phrase structure. However, the structure of queries can be very different from the standard phrase structure used in general text: Queries are often very short and the word order can be different from the typical full phrase which would be used in general text. Having a large number of parallel queries would enable training an SMT system adapted to translation of queries. However, no such corpora are available.
Moreover, even if such training data were to be made available, current SMT systems are usually trained to optimize the quality of the translation (e.g., using the BLEU score for assessing the quality of the translations output by the Moses phrase-based SMT system). This means that for a typical task related to query translation, such as Cross-Lingual Information Retrieval (CLIR) the optimization function used is not correlated with the retrieval quality. For example, the word order which is crucial for good translation quality (and is taken into account by most MT evaluation metrics) is often ignored by IR models.
Conventional CLIR systems often employ components for query translation, document indexing, and document retrieval. While the translation is often considered independently from the retrieval component, several attempts have been made to bring them together. For example, a probabilistic model embeds the query translation step into the retrieval model. See, for example, Hiemstra, D. and de Jong, F., “Disambiguation Strategies for Cross-Language Information Retrieval, ECDL 1999, pp. 274-293. However, this approach requires access to a document index, which is not feasible in the context of a translation service, where the collection to be searched is often unknown.
The exemplary embodiment addresses these problems, and others, by integrating IR metrics into a machine translation system by using a reranking framework.