With more and more different languages becoming used on the Web, both with respect to Web documents and the Internet users, the task of searching for content across multiple languages has become more and more difficult. Multilingual information retrieval (MLIR) for web pages technology works in this area, but is complex due to many language barriers, and in particular translation problems.
For example, given a query in one language, one multilingual search approach attempts to locate documents in another, target language for a query by using machine translation to translate the query into the target language. Another approach performs a document translation, and essentially works in the opposite direction.
However, not only does such machine translation to find relevant documents, generally because of translation issues, but ranking the retrieved pages of different languages is not very good. In general, it is very difficult to estimate cross-lingual relevancy because of the information loss due to the imperfect translation
MLIR ranking is even more difficult because documents in multiple languages have to be compared and merged appropriately. In short, there is lack of suitable ranking algorithms for multilingual web search.
A research field referred to as “learning-to-rank” is directed to learning a unique ranking function for a set of documents; a multilingual version is directed to learning a unique ranking function for documents of different languages. This is done intuitively by representing documents of different languages within a unified feature space, and performing a monolingual ranking task. However, the information loss and misinterpretation due to imperfect queries and document translation makes multilingual search ranking a very difficult problem, and heretofore has not been acceptable in many instances.