Many multi-lingual applications, such as machine translation or cross-language information retrieval software, require bilingual lexicon to produced desired translation results. However, manually compiled bilingual dictionaries are often inadequate to serve this purpose due to their limited coverage. For example, machine translation or cross-language information retrieval software may be unable to correctly translate a first term written in a first language to a second term of the same meaning in a second language due to the fact that the first term is not in the presently used bilingual dictionary. Such terms may be referred to as Out-Of-Vocabulary (OOV) terms. These OOV terms may severally deteriorate the quality of a machine translated document, or drastically hinder the ability of cross-language information retrieval software to retrieve relevant data.
With a sharp increase of bilingual pages (web pages with content in at two or more languages), web mining of term translations, that is, a term in a first language proximately located to a translation of the term in a second language, can greatly alleviate this problem. Current web mining methods may rely heavily on co-occurrence statistics. However, such methods are often unreliable in extracting low frequency term translations or term translations that occur only in a few web pages on the World Wide Web. Such unreliability is generally due to the fact that low frequency term translations are often hard to find using search engines, as well as due to the fact that low frequency term translations are more likely to be subject to noise during mining. Since the majority of term translations available on the Web are in fact low frequency term translations, current web mining methods are ill suited for large scale mining.
In other instances, some web mining methods may manually define a set of pattern rules to extract term translations from web pages, as layout patterns of term translations on a single web page tend to occur in similar patterns. However, a major problem of these methods is that the layout patterns of term translations may vary from web page to web page, so that the use of a fixed set of pattern rules cannot cover all bilingual web pages and often extract noise from non-bilingual pages.