The present invention deals with machine translation. More specifically, the present invention deals with a machine translation system that uses syntactic dependency treelets.
Machine translation involves the process of receiving an input text fragment in a source language and translating it, automatically through the use of a computing device, to a corresponding text fragment in a target language. Machine translation has typically been attempted using one of two different approaches. The first is a knowledge engineered approach, typically using a linguistic parser and hand-crafted transfer rules. Almost all commercial translation systems (such as Systran) are of this type. The second is a corpus motivated approach, typically either example-based machine translation (EBMT) or statistical machine translation (SMT). However, SMT appears more promising in current research, so this discussion will focus primarily on SMT and not EBMT. Typically the transfer-based systems incorporate linguistic information using a parser, and the SMT systems do not. Both approaches have strengths and weaknesses.
SMT systems perform well in learning translations of domain-specific terminology and fixed phrases, but simple grammatical generalizations are poorly captured and often confused during the translation process. Transfer-based systems, by contrast, often succeed in producing grammatical and fluent translations, but are highly time consuming to develop. Also, they often fail in exactly the area where SMT succeeds: domain-specificity.
Attempts have also been made to combine different aspects of the two types of machine translation systems into a single, hybrid system. However, these attempts have still suffered from disadvantages. Let us briefly survey the state-of-the-art in SMT as well as some prior art attempts to combine syntax and SMT.
Statistical machine translation initially attempted to model translation as a series of separate translation decisions, one for each word. However, the sheer computational complexity of the problem was a difficult obstacle to overcome, and it proved difficult to capture local context in a word-to-word statistical model. Thus the resulting systems were often rather slow and produced only moderate quality translations. Recently, however, statistical machine translation has shown new promise with the incorporation of techniques for performing phrasal translations. Instead of attempting to model the translation of each word independently, phrasal statistical machine translation attempts to model how chunks of words translate together. This captures an important intuition of foreign language learning—that is, small idioms and common phrases are both idiosyncratic and important for both fluency and fidelity.
Current phrasal statistical machine translation systems are conceptually simple. Beginning with a word alignment, all contiguous source and target word sequences (contiguous on the surface strings) are gathered as possible phrase translation pairs or alignment templates. These pairs are collected into a single translation repository. Then, a translation probability is associated with each distinct pair by using a maximum likelihood estimation model such as that set out in Vogel et al. THE CMU STATISTICAL MACHINE TRANSLATION SYSTEM, Proceedings of the MT Summit, (2003). Other probability models can be used as well. The specific translation model set out in Vogel is used in combination with at least a target language model to form a classic noisy channel model. The best scoring translation is found by a simple search: a monotone decoder assumes that source phrase order is preserved and uses Viterbi decoding to find the best path through the translation lattice. In some systems, a small amount of phrase reordering is allowed where the phrasal movement is modeled in terms of offsets.
While this type of system is an improvement over other types of systems where no reordering is allowed, the reordering model used in this type of system is limited in terms of linguistic generalizations. For instance, when translating English to Japanese, the English subject-verb-object clauses generally become Japanese subject-object-verb clauses, and English post-modifying prepositional phrases become Japanese pre-modifying prepositional phrases. While the phrasal reordering model above might learn that reorderings are more common in English-Japanese than in English-French, it does not learn that the subject is likely to stay in place while the object is likely to move before the verb; nor does it learn any generalization regarding prepositional/postpositional phrase movement. Instead, a phrase-based decoder in accordance with the prior art acts at the mercy of rote-memorized phrases and a target language model bias towards fluency, not necessarily accuracy.
In addition, as mentioned above, prior art phrasal statistical machine translation systems are currently limited to phrases that are contiguous. By this, the prior art systems meant that the phrases are contiguous in both the source and target surface strings. This limitation means that even something as simple as “not”→“ne . . . pas” cannot be learned. Using extremely large data sets for training can partially compensate for this, by simply memorizing a wide variety of possibilities. However, less common discontiguous “phrases” will be nearly impossible to learn, given practical limits on the size of the training data set.
For these reasons, and others, some researchers have attempted to incorporate syntactic information into statistical machine translation processes. One very simple method of doing this is by reranking. In other words, a baseline SMT system is used to produce an N-best list of translations, and then a group of models, possibly including syntactic models, is used to rerank the output. One such system is described in Och et al., A SMORGASBORD OF FEATURES FOR STATISTICAL MACHINE TRANSLATION, Proceedings of the Joint HLT/NAACL Conference (2004). This has proven to be a rather tenuous means of introducing syntactic information because an N-best list of even 16,000 translations captures only a very small fragment of the translation possibilities for a 20 word sentence and post-facto reranking provides the syntactic model no opportunity to boost or prune large sections of that search space within the baseline decoder.
Inversion transduction grammars (ITGs) are used in another prior art attempt to incorporate a notion of constituency into statistical machine translation. The basic idea is to consider alignment and translation as simultaneous parses of the source and target language. Two types of binary branching rules are allowed. Either the source and target constituents are produced in the same order, or the source and target constituents are produced in reverse order. Some such systems are described in Wu, STOCHASTIC INVERSION INDUCTION GRAMMARS AND BILINGUAL PARSING OF PARALLEL CORPORA, Computational Linguistics, 23(3):377-403 (1997); Wu and Wong, MACHINE TRANSLATION WITH A STOCHASTIC GRAMMATICAL CHANNEL, Proceedings of the ACL (1998); Zens and Ney, A COMPARATIVE STUDY ON REORDERING CONSTRAINTS AND STATISTICAL MACHINE TRANSLATION, Proceedings of the ACL (2003); and Zens et al., REORDERING CONSTRAINTS FOR PHRASE-BASED STATISTICAL MACHINE TRANSLATION, Proceedings of COLING (2004) These grammars are theoretically interesting. However, in order to make these types of processes computationally efficient, a number of severely limiting simplifying assumptions must be made. This significantly reduces the modeling power of such systems. In addition, this type of translation model acts only at the level of a single lexical item at a time (i.e., at the word level) and phrasal combinations are not modeled directly. This is a rather severe limitation. The demonstrated translation quality of these systems has not been on par with the best SMT systems.
A more recent theoretical approach has been presented using multi-text grammars and generalized multi-text grammars and attempts to generalize the inversion transduction grammar approach by allowing non-contiguous translations and loosening the reordering constraints. While this theory has been proposed, there are no details presented on parameter estimation, there is no description of how decoding in this framework is to incorporate phrasal information, no actual system has been built and no translation quality numbers presented This theory is described in greater detail in Melamed and Wang, STATISTICAL MACHINE TRANSLATION BY PARSING, Technical Report 04-024 Proteus Project (2004).
Another prior art approach related to Inversion Transduction Grammars uses head transducers to produce a translation by simultaneously parsing the source sentence and transuding a target dependency tree using a collection of transducers that apply independently to each level of a source dependency tree. These transducers are limited in scope. They rely only on very local context, such that the end result is a fundamentally word-based (as opposed to phrase-based) decoder. The transducer induction process is also likely complicated by data sparsity problems. Instead of factoring the translation modeling into several different components (such as lexical selection, ordering, etc.), only a single transducer is trained. One such system is set out in Alashawi, et al., LEARNING DEPENDENCY TRANSLATION MODELS AS COLLECTIONS OF FINITE-STATE HEAD TRANSDUCERS, Computational Linguistics, 26(1):45-60 (2000).
A tangential line of research as formed at the confluence of dependency transducers and multi-text grammars. This line of research deals with synchronous dependency insertion grammars and is described in more detail in Ding and Palmer, SYNCHRONOUS DEPENDENCY INSERTION GRAMMARS: A GRAMMAR FORMALISM FOR SYNTAX BASED STATISTICAL MT, In COLLING 2004: Workshop on Recent Advances in Dependency Grammars (2004).
In yet another prior art attempt, in order to improve the problems with fluency in an SMT system, a parser has been employed in the target language. By employing a parser in the training data, one can learn probabilities for a set of operations to convert a target language tree to a source language string. These operations can be combined with a tree-based language model to produce a noisy channel translation search. One such system is set out in Yamada and Knight, A SYNTAX-BASED STATISTICAL TRANSLATION MODEL, Proceedings of the ACL (2001). This type of system does have some positive impact on fluency, but does not improve overall translation quality as compared to a non-syntactic SMT system.
Another prior art approach for employing dependency information in translation is by translating via paths in the dependency tree. One such system is described in Lin, A PATH-BASED TRANSFER MODEL FOR MACHINE TRANSLATION, Proceedings of COLLING (2004). This is believed to be the only prior art system to apply a separate dependency parser to the source sentence before attempting translation. While this type of system does appear to incorporate larger memorized patterns (like phrasal SMT) in combination with a dependency analysis, the statistical modeling in the system is extremely limited. Only a direct maximum likelihood estimation translation model is used. The decoding process thus does not balance fidelity against fluency using, for example, a target language model nor does it benefit from the host of other statistical models that give SMT systems their power. The paths are combined in an arbitrary order. Finally the restriction imposed by this approach that the “phrases” extracted from the dependency trees be linear paths is quite detrimental. Not only does it lose promising treelet translations in a non-linear branching configuration, but it also cannot model certain common phrases that are contiguous in the surface string but non-linear in the dependency tree. Thus, while the resulting translations seem to benefit somewhat from the use of dependency paths, the overall approach does not come close to the translation quality of a phrasal SMT decoder
From the above discussion, it can be seen that the vast majority of syntactic statistical machine translation approaches have focused on word-to-word translation, instead of phrasal statistical machine translation, and have treated parsing and translation as a joint problem rather than employing a separate parser prior to translation. The one approach that uses a separate parser is very limited in scope, combines paths in an arbitrary order and has not employed a combination of statistical models which severely limits possible translation quality.