The present invention relates generally to methods, and apparatus therefor, for extracting bilingual lexicons from comparable corpora using a bilingual dictionary.
Bilingual documents with comparable corpora contain text written in different languages that, as defined herein, refer to a similar topic, but are not translations of each other (i.e., parallel corpora). That is, while corpora that are comparable “talk about the same thing”, corpora that are parallel are direct or mutual translations that maximizes comparability (e.g., as found in a bilingual dictionary). Except for multilingual documents produced by translation for international or government organizations (e.g., parallel product documentation), most multilingual corpora are not parallel but instead comparable (e.g., comparable news stories). However, while few applications produce bilingual lexicons extracted from parallel corpora, many applications require bilingual lexicons.
Thus, due to the limited availability of parallel corpora, methods exist for extracting bilingual lexicons from comparable corpora. Some of these known methods of bilingual lexicon extraction from comparable corpora (BLEFCC) are described, for example, in the following publications, which are incorporated herein by reference and hereinafter referred to as the “BLEFCC References”: Rapp, “Identifying Word Translations In Nonparallel Texts”, in Proceedings of the Annual Meeting of the Association for Computational Linguistics, 1995; Peters et al., “Capturing The Comparable: A System For Querying Comparable Text Corpora”, in JADT'95—3rd International Conference on Statistical Analysis of Textual Data, pages 255-262, 1995; Tanaka et al., “Extraction Of Lexical Translations From Non-Aligned Corpora”, in International Conference on Computational Linguistics, COLING'96, 1996; Shahzad et al., “Identifying Translations Of Compound Nouns Using Non-Aligned Corpora”, in Proceedings of the Workshop MAL'99, pages pp. 108-113, 1999; Fung et al., “A Statistical View On Bilingual Lexicon Extraction—From Parallel Corpora To Nonparallel Corpora”, in J. Veronis, editor, Parallel Text Processing, Kluwer Academic Publishers, 2000.
Bilingual lexicon extraction from comparable corpora as described in the BLEFCC References relies on the assumption that if two words are mutual translations, then their more frequent collocates, in a broad sense, are likely to be mutual translations as well. Based on this assumption, the approach described in the BLEFCC References for extracting bilingual lexicons from comparable corpora builds context vectors for each target word, translates the target context vectors using a bilingual dictionary, and compares the translation context vector with the target context vector.
With these known methods, two problems of “coverage” and “polysemy/synonymy” are known to exist when using a bilingual dictionary to extract bilingual lexicons from comparable corpora. The first problem of coverage occurs when too few corpus words are covered by the bilingual dictionary. However, if the context of the bilingual dictionary is large enough, the method described in the BLEFCC References should contend with the problem of coverage with frequently used words (as opposed to rare words) since it is likely that some context words will belong to the general language.
The second problem of polysemy/synonymy associated with the use of bilingual dictionaries to extract bilingual lexicons from comparable corpora arises when dictionary entries have the same meaning (i.e., synonymy) or several meanings (i.e., polysemy), which becomes more significant when only one meaning is represented in the corpus. The method described in the BLEFCC References does not contend well with the polysemy/synonymy problem because all entries on either side of the bilingual dictionary are treated as orthogonal.
Similarities with respect to synonyms should preferably not be independent, which the method described in the BLEFCC References does not take into account. As will be seen in more detail below in one geometric embodiment, axes corresponding to two synonyms si and sj are orthogonal, so that projections of a context vector on synonyms si and sj will in general be uncorrelated. Consequently, this geometric embodiment provides that a context vector that is similar to synonym si may not necessarily be similar to synonym sj.
Similar to synonymous entries in a bilingual dictionary, similarity with respect to polysemous entries should not be considered independent. For example, the word “bank” in English means either a “financial institution” (having French translation “banque”) or the “ground near a river” (having French translation “berge”). If only the English/French pair bank/banque were to appear in a bilingual dictionary, the method described in the BLEFCC References would consider the English word “river”, which co-occurs with “bank”, similar to the French word “argent” (meaning “money” in English), which co-occurs with “banque” (meaning “financial institution” in French).
As the availability of comparable corpora increases (e.g., newspaper articles or patent documents published in different languages), it would be advantageous to provide improved methods for extracting bilingual lexicons from comparable corpora that solves these and other problems. Such improved methods would advantageously optimize the use of existing bilingual dictionaries by, for example, augmenting existing bilingual dictionaries when extracting bilingual lexicons from comparable corpora.
Accordingly embodiments of a method, system and article of manufacture therefor, for extracting bilingual lexicons from comparable corpora using a bilingual dictionary are disclosed herein. The embodiments consider the vector space formed by all of the source bilingual dictionary entries (i.e., source words present in a bilingual dictionary), as well as, the vector space formed by all of the target bilingual dictionary entries.
In accordance with one aspect of the embodiments, one or more embodiments address the coverage problem described above as it applies to rarer words by bootstrapping the bilingual dictionary by iteratively augmenting it with the most probable translations found in the comparable corpora or corpus.
In accordance with another aspect of the embodiments, one or more embodiments address the polysemy/synonymy problem described above by recognizing that context vectors of bilingual dictionary entries provide some additional information with respect to synonymy and polysemy. For synonymy, for example, one or more embodiments recognize that it is likely that the synonyms si and sj are similar while at the same time context vectors {right arrow over (si)} and {right arrow over (sj)} are similar. For polysemy, for example, one or more embodiments recognize that if the context vectors {right arrow over (banque)} and {right arrow over (bank)} have few translation pairs in common, it is likely that “banque” and “bank” are used with somewhat different meanings.
In accordance with the embodiments disclosed herein, there is provided a method, apparatus and article of manufacture therefore, for identifying bilingual pairs in comparable corpora using a bilingual dictionary. The method includes: (a) using the comparable corpora to build source context vectors and target context vectors; (b) defining: (i) a source word space with the source context vectors, and (ii) a target word space with the target context vectors; (c) using the bilingual dictionary to project: (i) the source context vectors from the source word space to a source dictionary space, and (ii) the target context vectors from the target word space to a target dictionary space; (d) using the source and target context vectors from the dictionary spaces to identify source/target context vector pairs in a bilingual space; (e) computing a similarity measure for the source/target context vector pairs identified in the bilingual space to identify a bilingual pair.
In accordance with additional embodiments disclosed herein, there is provided a method, apparatus and article of manufacture therefore, for identifying bilingual pairs in comparable corpora using a bilingual dictionary. The method does not assume the dictionary space is orthogonal and includes: (a) using the comparable corpora for building for each source word v a context vector {right arrow over (v)} and for each target word w a context vector {right arrow over (w)}; (b) computing source and target context vectors {right arrow over (v)}′ and {right arrow over (w)}′ projected into a sub-space formed by source {right arrow over (s)} and target {right arrow over (t)} bilingual dictionary entry context vectors (where the projected source and target context vectors {right arrow over (v)}′ and {right arrow over (v)}′ include words in the bilingual dictionary that are not present in their corresponding context vectors {right arrow over (v)} and {right arrow over (w)}); and (c) computing a similarity measure between pairs of source words v and target words w using their context vectors {right arrow over (v)}′ and {right arrow over (v)}′ projected to a bilingual space to identify bilingual pairs.
In accordance with yet additional embodiments disclosed herein, there is provided a method, apparatus and article of manufacture therefore, for identifying using canonical correlation analysis (CCA) bilingual pairs in comparable corpora using a bilingual dictionary. The method includes: (a) building context vectors for source and target words and a number of bilingual dictionary entries using the comparable corpora; (b) computing canonical axes associated with the context vectors of bilingual pairs for defining a mapping from the comparable corpora to canonical subspaces defined by the canonical axes; (c) computing a similarity measure between source and target context vector pairs in a bilingual space defined by their canonical subspaces.
In accordance with yet further additional embodiments disclosed herein, there is provided a method, apparatus and article of manufacture therefore, for identifying using latent semantic analysis (LSA) bilingual pairs in comparable corpora using a bilingual dictionary. The method includes: (a) building context vectors for source and target words and a number of bilingual dictionary entries using the comparable corpora; (b) computing non-linear latent semantic projections of the context vectors of bilingual pairs for defining a mapping from the comparable corpora to latent semantic subspaces associated with the projections; (c) computing a similarity measure between source and target context vector pairs in a bilingual space defined by their latent semantic subspaces.
In accordance with the various embodiments, the comparable corpora generated using the embodiments described herein may be used, for example, in applications, such as, cross-language information retrieval, cross-language categorization, bilingual resource development, trans-lingual interference for resource development, and data mining tools that track and translate new terms. Advantageously, such applications would not rely solely on the use of parallel corpora which involves accepting a translation bias inherent to parallel corpora.