1. Technical Field
A “Cross-Lingual Unified Relevance Model” provides various techniques for improving the web search quality of a low-resource language/market using the search results from a high-resource language/market.
2. Context
In general, the quality of web search relies on the availability of data resources used to develop search engines. Such resources include, but are not limited to, labeled training data, large amounts of web documents, and large amounts of user feedback data (e.g., user search histories and corresponding click-through logs, etc.). However, such data resources are very low in many languages/markets. A simple example of this is that data resources for user English language searches in the United States tend to be more complete than similar data resources for Korean language searches in South Korea.
More specifically, modern web search engines generally rely heavily on data-driven approaches that go beyond traditional information retrieval (IR) ranking by incorporating additional features into machine-learned rankers. Typical ranker features include “PageRank”, click-through data, and various query and document classifiers. The “quality” of a learned ranker greatly depends upon the amount of training data available from different resources, such as human relevance judgments and user feedback.
While these resources are available in large quantities for some “high-resource” languages/markets (e.g., English/U.S.), for many other “low-resource languages/markets, the resources are not available or are very limited. Further, even if expensive human relevance judgments are collected, click-through data may not be plentiful for some smaller markets, while link analysis features may not be as helpful for nascent markets with fewer documents and links. Consequently, rather than annotating data for each low-resource market, several strategies have been applied to exploit existing high-resource rankers. One approach is to do domain adaptation of machine-learned rankers.
Some conventional work has addressed the language transfer challenge in IR by using English web search results to improve the ranking of non-ambiguous Chinese queries (referred to as linguistically non-local queries). Other research uses English as an assisting language to provide pseudo-relevant terms for queries in different languages. Unfortunately, the generality of these approaches is limited either by the type of queries or in the setting (e.g., traditional TREC style) they are explored. For example, one such technique uses training data from a general domain to improve the accuracy of English queries from a Korean market. However, in this particular case both in-domain and out-of-domain data are in English, hence the set of features used for the learning algorithm remain the same. Such solutions are not optimal, since, from a local perspective, users generally prefer local language queries that return local language results, e.g., Korean language queries that return Korean language results, rather than English language Queries from a Korean market.
In terms of tests to measure deviation of feature distributions, there have been some measures proposed in the domain adaptation literature to compute the distance between the source and target distributions of a feature. However, the literature in this area is mainly directed at deriving theoretical bounds on the performance of adapted classifiers. Consequently, such techniques may not be suitable to IR, because they do not adequately handle query-set variance and query-block correlation, and because such methods may require on-line computations during actual querying that can reduce query performance from a user perspective.