Various embodiments of the present invention relate to learning relationships among words. More specifically, various embodiments relate to a statistical approach for learning translation relationships among words in different languages.
Machine translation systems are systems that receive a textual input in one language, translate it to a second language, and provide a textual output in the second language. In doing this, such systems typically use a translation lexicon to obtain correspondences, or translation relationships, between content words which are obtained during training.
A common approach to deriving translation lexicons from empirical data involves choosing a measure of a degree of association between words in a first language, L1, and words in a second language, L2, in aligned sentences of a parallel bilingual corpus. Word pairs (consisting of a word from L1 and a word from L2) are then ordered by rank according to the measure of association chosen. A threshold is chosen and the translation lexicon is formed of all pairs of words whose degree of association is above the threshold.
For example, in one prior art approach, the similarity metric (the measure of degree of association between words) is based on how often words co-occur in corresponding regions (e.g., sentences) of an aligned parallel text corpus. The association scores for the different pairs of words are computed and those word pairs are sorted in descending order of their association score. Again, a threshold is chosen and the word pairs whose association score exceeds the threshold become entries in the translation lexicon.
This type of method, however, has disadvantages. One problem is that the association scores are typically computed independently of one another. For example, assume the words in language L1 are represented by the symbol Vk, where k is an integer representing different words in L1; and words in language L2 are represented by Wk, where k is an integer representing different words in L2. Thus, sequences of the V's and W's represent two aligned text segments. If Wk and Vk occur in similar bilingual contexts (e.g., in the aligned sentences), then any reasonable similarity metric will produce a high association score between them, reflecting the interdependence of their distributions.
However, assume that Vk and Vk+1 also appear in similar contexts (e.g., in the same sentence). That being the case, there is also a strong interdependence between the distributions of Vk and Vk+1. Thus, the problem results that if Wk and Vk appear in similar contexts, and Vk and Vk+1 appear in similar contexts, then Wk and Vk+1 will also appear in similar contexts. This is known as an indirect association because it arises only by virtue of the associations between Wk and Vk and between Vk+1 and Vk. Prior methods that compute association scores independently of each other cannot distinguish between a direct association (e.g., that between Wk and Vk)and an indirect association (e.g., that between Wk and Vk+1). Not surprisingly, this produces translation lexicons replete with indirect associations, which are likely incorrect as well.
As a concrete example of an indirect association, consider a parallel French-English corpus, consisting primarily of translated computer software manuals. In this corpus, the English terms “file system” and “system files” occur very often. Similarly, the corresponding French terms “systēme de fichiers”, and “fichiers systēme” also appear together very often. Because these monolingual co-locations are common, the spurious translation pairs fichier/system and systēme/file also receive rather high association scores. These scores may be higher, in fact, than the scores for many true translation pairs.
This deficiency has been addressed by some prior techniques. For example, Melamed, Automatic Construction of Clean Broad-Coverage Translation Lexicons, Second Conference of the Association for Machine Translation in the America's (AMTA 1996), Montreal Canada, is directed to this problem.
Melamed addresses this problem by disregarding highly associated word pairs as translations if they are derived from aligned sentences in which there are even more highly associated pairs involving one or both of the same words. In other words, it is assumed that stronger associations are also more reliable and thus direct associations are stronger than indirect associations. Therefore, if a segment (or sentence) containing V is aligned with a segment (or sentence) containing both W and W′ the entries (V,W) and (V,W′) should not both appear in the translation lexicon. If they do, then at least one is likely incorrect. Since we assume there is a tendency for direct associations to be stronger than indirect associations, then the entry with the highest association score is the one chosen as the correct association.
In the example discussed above, in parallel English and French sentences containing “fichier” and “systēme” on the French side and “file” and “system” on the English side, the associations of fichier/system and systēme/file will be discounted, because the degree of association for “fichier/file” and “systēme/system” will likely be much higher in the same aligned sentences.
While this approach is reported to extend high accuracy output to much higher coverage levels than previously reported, it does have disadvantages. For example, it is quite complex and cumbersome to implement, and it is believed to be quite time consuming to run.
Another difficulty encountered in learning translation relationships among words involves compounds (or multi-word sequences which are taken together to form compounds). Such compounds may translate to a single word in the other language, or to multiple words in the other language. Prior techniques assumed that lexical translation relationships involved only single words of course, as shown from the following list of compounds, this is manifestly untrue:
Base_de_donnees/database
Mot_de_passe/password
Sauvegarder/back_up
Annuler/roll_back
Ouvrir_session/log_on
In the first four pairs listed above, a compound in one language is translated as a single word in another language. However, in the last example, a compound in one language is translated as a compound in the other language, and each of the individual components of the compound cannot be translated in any meaningful way into one of the individual components in the other compound. For example, “ouvrir” which is typically translated as “open”, cannot be reasonably translated as either “log” or “on”. Similarly, “session” which is typically translated as “session” also cannot be reasonably translated as either “log” or “on”.
One prior attempt to address this problem is also discussed by Melamed, Automatic Discovery of Non-Compositional Compounds in Parallel Data, Conference on Empirical Methods in Natural Language Processing (EMNLP 97) Providence, R.I. (1997). Melamed induces two translation models, a trial translation model that involves a candidate compound and a base translation model that does not. If the value of Melamed's objective function is higher in the trial model than in the base model, then the compound is deemed valid. Otherwise, the candidate compound is deemed invalid. However, the method Melamed uses to select potential compounds is quite complex and computationally expensive, as is his method of verification by construction of a trial translation model.