Statistical machine translation automatically learns how to translate using a training corpus. The learned information can then be used to translate another, “unknown” text, using information that the machine learned from the training operation.
However, current statistical machine translation models are typically not suited for certain types of expressions, e.g., those where statistical substitution is not possible or feasible. For example, the current state of statistical machine translation systems does not allow translating Chinese numbers into English until the numbers have been seen and the correct translation has been learned. Similar issues may exist for translations of names, dates, and other proper nouns.
In addition, it may be desirable to conform a machine translation output to certain formats. The most desirable format may be different than the training corpus, or inconsistent within the training corpus. As an example, Chinese names may be present in a training corpus with the family name first, followed by the surname. However, it is more conventional to print the translation in English with the first name first. This may make it desirable to change the output in order to deviate what was seen in the parallel training data.
Certain modern statistical machine translation systems have integrated a rule based translation component for things like numbers and dates. There have also been attempts to combine statistical translation with other full sentence machine translation systems by performing an independent translation with the different systems and deciding which of the systems provides a better translation.