A field of a statistical machine translation (SMT) has been evolving. At an early evolving stage, many translation systems were developed as using a word-based approach. That is, the translation system processes words as a basic translation element and replaces each source language word with a target language word to form a translated sentence. The probability of the sentence having the optimal translation result is approximated using a multiplication of probabilities indicating whether each target language work is an optimal translation for a corresponding source language word and using a language model probability for a sentence in the target language. For example, a Markov chain (for example, N-gram) language model is used for determining a language model probability.
Recently, with the advent of a phrase-based statistical machine translation (phrase-based SMT), a significant development has been achieved. By expanding a translation basic unit from words to phrases, a search space may be effectively reduced in the phrase-based statistical machine translation. Here, the phrase refers to a partial character string of several consecutive words.
A general phrase-based statistical machine translation system still has some disadvantages. For example, the general system can definitely perform a translation process of re-arranging several consecutive words recognized through training, but most general translation systems cannot describe a long-distance dependency relationship between words. Further, the translation process uses a hierarchical phrase structure in any approach for the machine translation. For example, synchronous context-free grammars are used both in the source language and the target language. Due to errors in segmentation for the translation and errors in phrase and word alignments for training translation rules, such approach has a problem of deteriorating translation accuracy when an accurate translation rule cannot be applied.
The tokenization process performs an important role in the statistical machine translation because tokenizing a source sentence determines a translation basic unit in the statistical machine translation system.
FIG. 1 is a conceptual diagram illustrating a tokenization process and a translation process in a conventional statistical machine translation system.
As shown in FIG. 1, the conventional statistical machine translation system includes a tokenization unit 110 and a decoder 120. The tokenization unit 110 performs a tokenization process in a pre-processing process. The tokenization unit 110 receives an untokenized character string and then generates a tokenized character string. Further, the decoder 120 receives the tokenized character string from the tokenization unit 110 and finds an optimal translation for the received character string.
Since various language tokenization processes are generally ambiguous, a statistical machine translation system in which the tokenization process is separated from the translation process may often generate translation errors due to errors in the tokenization process. Particularly, a method of segmenting sentences into proper words directly affects the translation performance for some languages such as Chinese which have no space in their writing systems. Further, in agglutinative languages such as Korean, one word may include a plurality of morphemes, so that a serious data spareseness problem may occur when the word itself is used as training data.
The segmentation of words into the unit of morphemes is to effectively improve the translation performance by taking the translation basic unit as the morpheme which is a minimal meaningful unit in various languages. Although a tokenization unit having an excellent performance is used, there is a limitation in improving a translation quality because the performance of the tokenization unit cannot be 100%. Accordingly, a more proper tokenization method of reducing an error problem in the tokenization is required in the statistical machine translation systems.
A lattice structure-based translation method for improving the translation performance corresponds to a method of replacing a 1-best tokenization with an n-best tokenization. However, a word lattice-based translation method still searches for a corresponding word in a limited search space. That is, since the method in which the tokenization process is separated from the decoding and the constructed token lattice relies on tokens filtered and pre-processed before the decoding process, the search space is still limited by the pre-processing process.
The general statistical machine translation systems always separates the tokenization process from decoding process as the pre-processing process, and conducts the decoding process as a separate process. Such a process is not optimized for the statistical machine translation. First, an optimal translation granularity for a given translation language pairs is ambiguous. While the general statistical machine translation system may face a serious data sparseness when using a large granularity, the general statistical machine translation system may lose a plurality of useful information when using a small granularity.
For example, in Chinese “duo fen” and English “gain a point”, since Chinese “duo fen” is in a one-to-one alignment with English “gain a point”, it is preferable to split “duo fen” into two words such as “duo” and “fen”. On the contrary, although Chinese “you” and “wang” are in a one-to-one alignment with English “will have the chance to”, it is not preferable to split “you wang” into two words. This is because the decoder 120 tends to translate Chinese “wang” into an English verb “look” without context information “you”.
Second, there may be an error in the tokenization process. Chinese “tao fei ke” is recognized as a Chineses name of which family name is “tao” and first name is “fei-ke”. However, by considering the context, a full character string “tao fei ke” should be translated as an Indonesian badminton player's name.
Meanwhile, replacing 1-best tokenization result with a lattice constructed using a plurality of tokens which correspond to tokenization results of one or more segmenters is helpful to improve the translation performance. However, the search space is still limited by the lattice constructed with the tokens segmented before the decoding, which is problematic.