Consider a text in a source language that needs to be translated into a target language. Current statistical MT systems operate on a sentence-by-sentence or line-by-line basis, with each source-language sentence or line being translated without consideration of the other sentences or lines in the surrounding text. Thus, if we extract a single sentence or line from the source text and put it into the MT system, that sentence will be translated in the same way as it would be if the whole source text were put into the system.
However, the meaning of a word sequence, and thus the way in which it may be correctly translated, typically depends on the context provided by surrounding text. Consider two hypothetical English paragraphs. The first paragraph begins: “After the auditor's report was released, doubt was cast on the institution's financial stability. Mr. Jones maintained that the bank was solid.” The second paragraph begins: “After the flood, the townspeople asked the hydraulic engineer about future problems with the river bank. Mr. Jones maintained that the bank was solid.” Since most languages do not share with English the convention that “bank” can mean both a financial institution and the slope leading to a body of water, nor the convention that “solid” can refer both to financial integrity and physical firmness, a good human translator would typically translate the sentence “Mr. Jones maintained that the bank was solid” into quite different target-language sentences, depending which paragraph it was in. By contrast, a conventional statistical MT system would translate this sentence in the same way in both cases. This is shown in the top part of FIG. 1, where each source sentence or translation unit is passed individually through the “global translation model” mapping source S onto target T. For the conventional system, information outside translation unit i does not affect the translation of translation unit i into the target language.
Each human language is made up of “sublanguages” or “discourse domains” within each of which words and phrases tend to have a single, unambiguous meaning. We can think of these sublanguages as regions located in a multidimensional sublanguage space. Of course, this space is continuous, and regions have fuzzy boundaries. That is, sublanguages overlap and blend into each other, forming a multidimensional continuum; they should not be thought of as discrete bubbles with clearly defined boundaries.
Most sublanguages will have equivalents in other human languages. Given a parallel text corpus consisting of documents belonging to a given sublanguage expressed in both a source and a target language, one can use well-understood techniques from the field of statistical machine translation to train a model for translating new documents written in the source language and belonging to that sublanguage to the target language. For instance, one might use a corpus of financial news stories in which each story is expressed in both English and French to train a system for translating financial news from English to French. Similarly, a corpus of parallel English and French articles about geography could be used to train a system for translating English texts about geography into French. The former system would tend to translate the English word “bank” into “banque” (the financial institution) while the latter would tend to translate it into “rive” (the geographical feature). Of course, the problem is that although today's techniques show how to create a specialized MT system for translating documents within a certain domain or sublanguage, many—perhaps most—applications of MT require a system that can produce satisfactory translations of documents that come from a varying, unpredictable mix of domains. Today's techniques are unable to create such a system.
A defender of current statistical machine translation systems might say that these systems can already be adapted to a new domain, by virtue of the fact that they tend to work well for the type of document they have been trained on. For instance, one may have trained such a system on a group of news stories about events of general interest. Then, a client requests a customized version of the system that will produce good translations for specialized articles about finance. One then “adapts” the system to this domain by retraining it on a bilingual, parallel corpus of specialized articles about finance. For instance, a system trained on financial articles will know only the financial meaning of the English word “bank”, and thus accurately translate this word when new sentences from financial articles are given to it.
However, training a current statistical machine translation system is a slow off-line process, requiring hundreds or thousands times more computations than the process of translation itself. Currently, retraining a state-of-the-art system for a new domain may take several days. One might try to anticipate future needs by training a current system on a mix of types of documents, but that will not yield good results: it just increases the amount of ambiguity. For instance, a current system trained on a mixed set of documents might translate “bank” half the time with the financial sense and half the time with the geographical sense irrespective of the type of document it is given to translate, thus yielding poor overall performance. Thus, the defender of current statistical machine translation system would be wrong—these systems cannot be adapted quickly.
There is some background prior art concerned with statistical machine translation.
U.S. Pat. No. 5,805,832 by Brown et al. discusses a means of translating text from a source language into a target language. The system assigns probabilities and scores to various target language translations and makes available the translations with the highest score and probabilities.
U.S. Pat. No. 6,182,026 by Tillmann et al describes translating a source text into a target text through mapping of source words on target words using both translation model and language model.
And the published application US 2004/0030551 by Marcu et al. discloses the use of phrase based joint probability model where the model learns phrase to phrase alignments from word to word alignments generated by a machine translation system.
None of this prior art teaches the construction of a bilingual sublanguage space for adaptive statistical machine translation, nor the use of extra-sentential context in the source text for adaptive statistical machine translation.
A group of researchers at Carnegie Mellon University (CMU) has been exploring a form of adaptation of a statistical MT system in which information from the source sentence currently being translated is used to adapt the target language model, but neglects to take the context into consideration. This is done by feeding one or more initial translation hypotheses for the source sentence into an information retrieval system, which then locates documents in the target language that are used to retrain the language model. The source sentence is then retranslated, using the new language model. For this CMU approach, see “Language Model Adaptation for Statistical Machine Translation with Structured Query Models” by B. Zhao, M. Eck, and S. Vogel (in COLING 2004, Geneva, Switzerland).
It will be seen that neither the construction of bilingual sublanguage space nor the use of extra-sentential information is involved in this CMU approach. Furthermore, the CMU approach is extremely inefficient in computational terms. For each new source sentence, one or more initial translations must be carried out, a large set of documents queried, the language model rebuilt, and the translation carried out again with the new language model.
Another attempt to achieve adaptation of a statistical machine translation system was presented by A. Lagarda and A. Juan at the 7-8 Nov. 2002 meeting of the “Transtype 2” European project in Valencia, Spain. Using a “bag of words” representation for sentences, these researchers clustered a set of training sentences to create a sentence mixture model. This model was used to help decode new sentences. The experimental results showed no improvement over the original system. The system decoded sentences individually, without considering their context in the source text.