Query translation is a common technique utilized by Cross-Language Information Retrieval (CLIR) systems, which are designed to retrieve information written in a language different from the language of the user's query. CLIR systems are implemented into search engines, online dictionaries, and numerous other applications where translation of terms is desirable. For queries that contain out-of-vocabulary (OOV) terms that cannot be translated with a known database of translation pairs, system performance severely degrades. For example, an analysis of a query log for a Chinese search engine reveals that over 80% of the top 19,124 most frequently searched terms are not included in the typical Chinese-English dictionary. Due to the fact that the average length of web queries is short, such as two or three words, a single occurrence of an OOV term in a query can severely deteriorate the relevance of the retrieved search results. To deal with the OOV issue, a database of known translations not contained in a typical bilingual dictionary can be built, but new terms that require translation are constantly entering the lexicon. For example, terms corresponding to new products, new movie names, new entertainers, new slang words, etc. are constantly appearing. Manually adding translations for all of these new terms would require an impractically large amount of human effort.
Due in part to a sharp increase in the quantity of multi-lingual resources, the Internet has shown great promise as a resource for mitigating some of the limitations of CLIR systems. Recent research on automated web mining methods for term translations has primarily focused on utilizing mixed-language webpages where terms and their translations co-occur in the same page. In these bilingual webpages, translations for foreign terms occur with the foreign terms. Such pages are fairly common on the web for many language pairs such as Chinese-English, Japanese-English, Spanish-English, and many other language pairs.
A first approach to extracting the information contained in these webpages is the search-snippet-based method which leverages co-occurrence statistics from the search snippets of bilingual webpages. The search-snippet-based method involves searching for a foreign term in native language documents, and from the top-n returned snippets of the relevant bilingual pages, selecting as the translation of the foreign term, the native language string that has the highest co-occurrence count with the foreign term. This method is based on the assumption that the more frequently a term co-occurs with the foreign term in the snippets, the more likely that the foreign term is the translation. This approach is effective in mining high frequency term translations, but is ineffective for low frequency term translations because a search engine's relevance ranking algorithm typically is not based on the occurrence of a term's translation. Low frequency terms comprise a significant portion of the bilingual lexicon, thus severely limiting the effectiveness of the snippet-based mining scheme.
To complement search-snippet-based mining, a second approach can be utilized to identify term translations using one or a fixed set of predefined layout patterns of translation pairs on a bilingual webpage, e.g., a term followed by its translation surrounded by parentheses,  (Superman). This second approach is able to discover low frequency term translation pairs as long as the pairs are captured by the patterns, but because webpages are created by different people, it is problematic to assume a finite set of patterns can cover every, or even most, bilingual webpages.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.