The exemplary embodiment relates to a machine translation system and method. It finds particular application in the context of preserving the place of intra-sentence separators, such as parentheses, brackets, quotes, tags, and the like during a machine translation process.
Statistical machine translation systems are often developed by inputting parallel data for the pair of languages for which the translation system is built. For example, mutual translations of source and target language texts and text fragments (parallel corpora) are used as data to feed a learning engine, which biulds a statistical language model that can then used by an actual translation tool that is used to translate any sentence of the source language into the target language. Parallel text discovery systems have been developed for identifying pairs of sentences or text fragments which are translations of one another starting from collections of parallel or comparable documents.
One problem in statistical machine translation is that the order of the words in the translated sentences is automatically learned by the system. In the learning process, the language model is trained based on n-grams (text fragments of a length n words). The language model thus captures the order of the words commonly used in a given language. For example, given the words black and cat, the system knows that black should be placed before cat because in the training corpus this combination is more frequent than the reverse order. The system may also capture statistically the information that in English, the adjective is usually positioned before the noun. In other words, the system is able to learn relative positions of the words during the training process.
When the system is faced with separators, such as parentheses, quotes, brackets, tags and the like, the statistical language model usually does not provide sufficient information to place them correctly. For example, consider the sentence:                The number of living languages (in 2007 about 6000, by most estimates) is decreasing rapidly.        
A machine translation system is liable to align the words incorrectly. For example, a French translation by a statistical machine translation may be:                Le nombre de langues (en 2007, vivantes environ 6000, par la plupart) est en baisse des estimations rapidement.        
Here, the two underlined groups of words are not correctly placed by the system.
The exemplary embodiment provides a system and method for statistical machine translation which takes separators, such as parentheses into consideration.