Modern machine translation systems use word to word and phrase to phrase probabilistic channel models as well as probabilistic n-gram language models.
A conventional way of translating using machine translation is illustrated in FIG. 1. FIG. 1 illustrates the concept of Chinese and English as being the language pair, but it should be understood that any other language pair may be alternatively used.
Training is shown as 150, where a training corpora 153 is used. The corpora has an English string 151 and a Chinese string 152. An existing technique may be used to align the words in the training corpora at a word level. The aligned words are input to a training module 155 which is used to form probabilities 165 based on the training corpora. A decoding module 167 is used that maximizes the argument argmax/e P(e)*P(f|e), and maximizes the probability of e, given certain languages in the corpora, where e and f are words or phrases in the training corpora. The decoding module 167, which may simply be a module within the same unit as the training module. The decoder thus takes a new Chinese string such as 160, and uses the probabilities 165 along with a language model 161 which may be an n-gram language model. The decoder outputs English strings which correspond to the highest scores based on the probabilities and the language model.
Phrase based systems may sometimes yield the most accurate translations. However, these systems are often too weak to encourage long-distance constituent reordering when translating the source sentences into a target language, and do not control for globally grammatical output.
Other systems may attempt to solve these problems using syntax. For example, certain reordering in certain language pairs can be carried out. One study has shown that many common translation patterns fall outside the scope of the Child reordering model of Yamada & Knight, even for similar language pairs such as English/French. This led to different possible alternatives. One suggestion was to abandon syntax on the grounds that syntax was a poor fit for the data. Another possibility is to maintain the valid English syntax while investigating alternative transformation models.