Statistical machine translation (SMT) is a machine translation technique where translations are generated on the basis of statistical models. The models' parameters are derived from the analysis of one or more bilingual text corpora, where a text corpus is a large and structured set of texts (e.g., usually electronically stored and processed). The statistical approach contrasts with the rule-based machine translation (RBMT) to machine translation as well as with example-based machine translation (EBMT).
The ideas behind SMT systems come from information theory. Essentially, a document is translated on the probability that a string in a native language (e.g., English) is a translation of a string in a foreign language (e.g., German). Benefits of SMT over other techniques include better use of resources (e.g., a great deal of natural language is in machine-readable format, SMT systems are not limited to any specific pair of languages, and RBMT systems require manual development of linguistic rules, which can be costly and often do not generalize to other languages) and more natural translations.
In word-based SMT, the translated elements are words, where the number of words in translated sentences are different due to compound words, morphology and idioms. Simple word-based translation is not able to translate language pairs with fertility rates different from one without mapping a single word in the foreign language to multiple words in the native language. However, the mapping typically does not work in the reverse translation.
As a result, phrase-based translation systems were developed to overcome this limitation, where translating sequences of words to sequences of words, where the lengths of phrases can differ. The sequences of words are called, for example, blocks or phrases, where the phrases are found using statistical methods from the corpus rather than linguistic phrases, because the use of linguistic phrases has been shown to decrease translation quality.
Statistical machine translation systems are widely advocated as a promising approach to achieving translation quality at least comparable to the best RBMT systems, with greatly reduced effort to adapt to new language pairs and new domains, provided that sufficient parallel training data is available. One such system is the widely-used Pharaoh phrasal SMT decoder (hereinafter Pharaoh or Pharaoh Decoder). However, to date, SMT systems have been much slower than the best RBMT systems. For example, LANGUAGE WEAVER, currently the only commercial provider of SMT systems, claims to translate 5,000 words per minute per CPU, while SYSTRAN, the market leader in commercial RBMT, claims to translate up to 450 words per second (27,000 words per minute) per CPU.
As a result, there is a desire to increase the speed and computational efficiency of SMT algorithms while preserving the advantages of SMT over other techniques (e.g., high translation quality and efficient adaptability to new language pairs).