Lexical translation is the task of translating individual words or phrases, either directly or as part of a knowledge-based machine translation (MT) system. In contrast with statistical MT, lexical translation does not require an aligned corpora as input. Because large aligned corpora are non-existent for many language pairs and are very expensive to generate, lexical translation is possible for a much broader set of languages than statistical MT. Generally, the information required for lexical translation is much easier to obtain than that required for aligned corpora.
While lexical translation has a long history, interest in it peaked in the 1990's. Many of these prior systems used machine-readable dictionaries (MRDs) to assist in the manual creation of lexicons, or used automated acquisition with post editing. Despite the shift in emphasis towards statistical MT, research on knowledge-based MT has continued, with its need for lexicon acquisition. The proliferation of MRDs and the rapid growth of multilingual Wiktionaries offer the opportunity to scale lexical translation to an unprecedented number of languages. Moreover, the increasing international adoption of the Web yields opportunities for new applications of lexical translation systems.
Translation lexicons are also a vital resource for cross-lingual information retrieval (CLIR), a subfield prompted in part by the TREC conferences and a series of SIGIR CLIR workshops. Much of the CLIR research has focused on a small number of language pairs building systems that must be adapted to one language pair at a time. While early CLIR systems typically relied on bilingual dictionaries, corpus-based methods or hybrid methods soon outstripped purely dictionary-based systems. Some of the methods used derive word-translations from parallel text. There are also hybrid systems that use corpus-based techniques to disambiguate translations provided by bilingual dictionaries.
The main drawback of using bilingual dictionaries, in past work, has been word-sense ambiguity. A single term in the source language is typically translated into multiple terms in the target language, mixing different wordsenses. Combining information from multiple bilingual dictionaries only exacerbates this problem: translating from language l1 into l2 and then translating each of the possible l2 translations into a third language l3, quickly leads to an explosion of translations.
On the Web, commercial search engines such as Google™, French Yahoo™, and German Yahoo™, offer query translation capability for only a handful of languages. For example, Google™ and other Internet companies have fielded word translator tools that enable a reader of a Web page to view the translation of particular words, which is helpful if the user is, for example, a Japanese speaker reading an English text who has come across an unfamiliar word. In contrast to the few languages for which translation is currently offered, it would be preferable to translate between a large number of languages, and preserve wordsenses, thereby inferring translations that are not found in any single dictionary. It would also be desirable to provide a translation platform for “plugging in” more and more dictionaries, and adding increasingly comprehensive Wiktionaries and corpus-based translations, all of which should lead directly to improved use of cross-lingual translations over time.
Lexical translation offers considerable practical utility in several different applications. While lexical translation does not solve the full machine-translation problem, it is valuable for a number of practical tasks including the translation of search queries, meta-tags, and individual words or phrases. Another prospective application for lexical translation is in searching for images or other non-text entities. Images represent an excellent example of entities that might more easily be found using lexical translations of an input word or phrase, although the same approach might be used to find other types of multimedia files, such as video files. Most search engines on the Internet retrieve images based on the words in the “vicinity” of the images, which limits the ability of a conventional search engine to retrieve more than a few of the relevant images that might otherwise be found. Although images are universally understood without regard to the language spoken/understood by the searcher, an English language search will fail to find images tagged with Chinese or other non-English language words or phrases. Similarly, a search made using Dutch language tags will fail to find images tagged in English or other languages.
To address this problem, it would be desirable to provide a cross-lingual image search capability that would enable searchers to translate and disambiguate their queries before sending them to a conventional image search engine, such as Google™. Currently, this approach would require considerable manual direct translation and entry of the resulting multi-lingual words or phrases in other languages that the searcher had manually determined were appropriate translations of a word or phrase of an initial language understood by the searcher.