Existing statistical machine translation machines presently require the availability of bilingual parallel or comparable corpora of a given source and target language and from target text corpora for a given target language. But they do not benefit from the availability of text corpora of the given source language.
Let S represent a sentence in the source language (the language from which it is desired to translate) and T represent its translation in the target language. According to Bayes's Theorem, it can be shown for fixed S that the conditional probability of the target sentence T given the source, P(T|S), is proportional to P(S|T)*P(T). Thus, the earliest statistical machine translation systems (those implemented at IBM in the 1990s) sought to find a target-language sentence T that maximizes the product P(S|T)*P(T), where P(T) is the “language model”, a statistical estimate of the probability of a given sequence of words in the target language. The parameters of the language model are estimated from large text corpora written in the target language. The parameters of the target-to-source translation model P(S|T) are estimated from a parallel bilingual corpus, in which each sentence expressed in the source language is aligned with its translation in the target language.
There also exist methods which explore bilingual comparable corpora. Such comparable corpora are collections of documents in both the source language S and the target language T, where it is known or suspected that the documents discuss the same or similar subjects, using roughly the same level of formality, technicality, etc., without necessarily being translations of each other. The existing methods identify parallel sentences in the comparable corpora and extract these as parallel bilingual data. These methods do not work perfectly and make errors. Moreover, they are based on the prerequisite that there exist sentences in the comparable corpora which actually are parallel.
Today's statistical machine translation (SMT) systems do not function in a fundamentally different way from these 1990s IBM systems, although the details of the P(S|T) model are often somewhat different, and other sources of information are often combined with the information from P(S|T) and P(T) in what is called a loglinear combination. This means that instead of finding a T that maximizes P(S|T)*P(T), these systems search for a T that maximizes a function of the form P(S|T)α1*P(T)α2*g1(S,T)β1*g2(S,T)β2* . . . *gK(S,T)βK*h1(T)δ1*h2(T)δ2* . . . *hL(T)δL, where the functions gi( ) generate a score based on both source sentence S and each target hypothesis T, and functions hj( ) assess the quality of each T based on unilingual target-language information. Just as was done in the 1990s IBM systems, the parameters of P(S|T) and P(T) are typically estimated from bilingual parallel corpora and unilingual target-language text respectively. The parameters for functions gi( ) are sometimes estimated from bilingual parallel corpora and sometimes set by a human designer; the functions hj( ) are sometimes estimated from target-language corpora and sometimes set by a human designer (and of course, a mixture of all these strategies is possible). Both of these functions gi( ) and hj( ) might also explore additional sources of information, such as part of speech or syntactic annotation. This annotation is sometimes given for both source and target language and sometimes for only one of the two.
Thus, we see that today's statistical machine translation systems benefit from the availability of bilingual parallel or bilingual comparable corpora for the two relevant languages S and T, since such corpora may be useful in estimating the parameters of the translation model P(S|T) and also, possibly, some bilingual components gi( ). Such SMT systems also benefit from the availability of text corpora in the target language T, for estimating the parameters of the language model P(T) and possibly other unilingual target-language components hj( ). Some SMT systems also benefit from additional information contained in annotated text.
However, acquiring unilingual text corpora in the source language S is not presently useful in improving an SMT system. To give an example, suppose one has a system for translating Chinese sentences into English sentences, and a huge collection of Chinese-only documents (with no accompanying English translations) becomes available—such a collection is not presently useful in improving the quality of Chinese-to-English translations produced by the system.