In most machine translation systems, a linguist assists in the writing of a series of rules which relate to the grammar of the source language (the language to be translated from) and the target language (the language to be translated to) and transfer rules for transferring data corresponding to the source text into data corresponding to the target text. In the classical “transfer” architecture, the source grammar rules are first applied to remove the syntactic dependence of the source language and arrive at something closer to the semantics (the meaning) of the text, which is then transferred to the target language, at which point the grammar rules of the target language are applied to generate syntactically correct target language text.
However, hand-crafting rules for such systems is expensive, time consuming and error prone. One approach to reducing these problems is to take examples of source language texts and their translations into target languages, and to attempt to extract suitable rules from them. In one approach, the source and target language example texts are manually marked up to indicate correspondences.
Prior work in this field is described in, for example, Brown P F, Cocke J, della Pietra S A, della Pietra V J, Jelinek F, Lafferty J D, Mercer R L and Roossin P S 1990, ‘A Statistical Approach to Machine Translation’, Computational Linguistics, 16 2 pp. 79-85; Berger A, Brown P, della Pietra S A, della Pietra V J, Gillett J, Lafferty J, Mercer R, Printz H and Ures L 1994, ‘Candide System for Machine Translation’, in Human Language Technology: Proceedings of the ARPA Workshop on Speech and Natural Language; Sato S and Nagao M 1990, ‘Towards Memory-based Translation.’, in COLING '90; Sato S 1995, ‘MBT2: A Method for Combining Fragments of Examples in Example-based Translation’, Artificial Intelligence, 75 1 pp. 31-49; Güvenir H A and Cicekli I 1998, ‘Learning Translation Templates from Examples’, Information Systems, 23 6 pp. 353-636; Watanabe H 1995, ‘A Model of a Bi-Directional Transfer Mechanism Using Rule Combinations’, Machine Translation, 10 4 pp. 269-291; Al-Adhaileh M H and Kong T E, ‘A Flexible Example-based Parser based on the SSTC’, in Proceedings of COLING-ACL '98, pp. 687-693.
Sato and Nagao developed a system which represents the source and target texts as planar dependency trees. A dependency tree is a particular type of dependency graph. In a dependency graph, the words of the text correspond to nodes which are linked so that a word which depends on another (i.e. modifies the meaning of or is in some relationship with another) is linked by a (directional) dependency relationship. A dependency graph is a tree if each node (other than one unique “root” node) depends on precisely one other (although one node may have several depending from it; in other words may dominate several others). A planar tree is a tree in which, when the words are arranged in their original sequence, the “projection constraint” is satisfied; in other words, every word within the span of each node is dominated by that node; in other words, graphically, no dependency line crosses another. Planar trees are particularly computationally efficient, and it is therefore advantageous to use them.
However, planar trees are only readily useful when the words which depend on each other in the source and target texts are contiguous; in other words, contiguous sequences of words in the source text are translation by contiguous sequences of words in the target text. At sentence level, this is likely to be true, but it would be desirable to reduce the maximum size of translation units (i.e. parts of the sentences such as phrases) which could be translated, since shorter phrases are more generally applicable, and hence allow more translation coverage from a smaller number of examples. Different approaches to this problem have been taken in the prior art. Because of this problem, it has not been possible simply to align source and target language phrases by shoring connection data connecting the head words of the phrases.
Sato's MBT2 method analyses both the source and target texts as simple planar trees, and uses simple tree alignment to express the relationship between the trees in the source and target languages. The trees found in the examples are generalised by allowing some specific transforms, such as adding and deleting nodes; and the translations produced by Sato's system are ranked using a measure of similarity with existing translations.