The web participation of various languages in addition to English, both in terms of Web documents and Internet users, has been rapidly rising. The task of search across multiple languages is becoming more and more demanding. However, multi-lingual information retrieval (MLIR) for web pages remains challenging due to the apparent language barriers. One of the dominant approaches is to bridge the gap between the query and the documents by translating the query in the source language into the target language, or translating the documents in the target language to the source language. To this end, cross-lingual search has been assisted by machine translation, which has seen substantial development in the recent years.
In addition to machine translation, ranking the retrieved pages of different languages is another critical issue. In general, search result ranking is one of the central problems for many IR applications. In the past, research attention has focused on monolingual search result ranking, in which the retrieved documents are all written in the same language. With the rise of the importance of multilanguage content and usage on the web, the importance of ranking for multilingual information retrieval (MLIR) is also increasing.
MLIR is defined as a task to retrieve relevant documents in multiple languages, and then rank the retrieved documents based on their relevancy to the query. Most existing approaches to MLIR require query translation followed by a monolingual IR. Typically, the queries are translated either using a bilingual dictionary, machine translation software or a parallel corpus. One of the challenges of MLIR is the difficulties to estimate the cross-lingual relevancy because of the information loss due to imperfect translation. Other factors, such as the need to compare and merge appropriately documents in multiple languages, make MLIR ranking even more difficult.
Ranking has been extensively explored in the monolingual IR studies. Particularly, machine learning approaches for ranking, known as learning-to-rank, have received intensive attentions in recent years. The learning task is to optimize a ranking function given a set of training data consisting of queries, their retrieved documents, and the relevance ranks of the documents made by human judgments. The learned ranking function is then used to predict the order of the retrieved documents for a new query.
However, even the learned-to-drank techniques of the existing methods do not learn a cross-lingual ranking function directly, nor do they directly estimate the contribution of individual ranking features to the MLIR relevancy. Instead, existing MLIR ranking techniques usually combine translation and monolingual IR to derive some relevancy scores which are then converted by some normalization methods to become comparable for combination and ranking. In MLIR, candidate documents retrieval is usually followed by a re-ranking process to merge several ranked document lists together. Existing MLIR ranking algorithms are focused on how to compare and combine the ranking scores associated with each document list for a merge. The MLIR ranking usually starts with a set of normalization methods which convert the relevance scores of different lists comparable. Typical normalization methods include Min-Max, Z-score, CORI, etc. The normalized scores are then combined based on either CombSUM algorithm or logistic regression model to generate the final MLIR relevance score. These learned-to-drank techniques do not learn ranking function directly, nor do they directly estimate the contribution of individual ranking features to the MLIR relevancy. These techniques hence do not work well for multi-lingual web search which involves a large amount of features.
Although learning-to-rank has become an active research field, there is little research effort to adapt the state-of-the-art ranking algorithms for multi-lingual search. Further effort in this direction is needed to develop better cross-lingual ranking or re-ranking techniques.