Translation of a long sentence has been a severe problem in Statistical Machine Translation (SMT). A SMT system always fails to give a correct translation result when a sentence is too long, sometimes even fails to deal with it.
To avoid the difficulty in translating a long sentence, people always split a long sentence into shorter sub-sentences and then deal with the shorter sub-sentences. Previous research had proved that this is an effective method. Better performance could be achieved by just simply splicing the translation results for each of the sub-sentences after splitting in turn, especially for spoken language sentences which tend to have simple structures.
To split an input long sentence, the first problem needs to be solved is to define reasonable splitting criteria, that is, define the right splitting positions. A corpus-based SMT system includes a large-scale parallel bilingual corpus for data model training. The source side corpus of the bilingual corpus can be used for training and learning the splitting positions. But there usually exist a number of long bilingual sentence pairs in corpus, which will cause following problems: first, sentences of source side that are too long cannot provide sufficient information for splitting; next, bilingual sentence pairs that are too long usually cause more word alignment errors which are harm to translation quality directly.
In general, punctuation characters can provide useful splitting information. However, it's difficult to obtain satisfactory results by use of punctuations directly or just complement with simple manually formulated rules. Moreover, because of the difference of syntactic system between different languages, splitting from the point of view of monolingual side alone may cause that translation results of sub-sentences are no longer relative independent sentences, or cause change in word order. Therefore, we need to split the parallel corpus from the point of view of bilingual sides.
After acquiring proper training corpus, another problem needs to be solved is how to split long input sentences into a plurality of sub-sentences. Splitting a long sentence can be looked as a sequence labeling task, i.e. label each word in word sequence of a long sentence, its labeled value is one from a given label set, and then, splitting is performed according the labeling results.
In summary, the following two problems need to be solved to improve translation quality of long input sentences in a SMT system:
(1) How to split parallel bilingual corpus in training phase;
(2) How to split long input sentences in decoding phase.
As to the first problem “splitting parallel bilingual corpus in training phase”, in previous research, “modified IBM-1 translation model” has been utilized to find an optimal splitting point in a bilingual sentence pair and split it into two parts; then, this method is done recursively over the split sub-sentence pair until length of each new sub-sentence is smaller than a pre-set threshold. However, this splitting method is relatively complicated.
Besides that, in previous research, the result of automatic word alignment has also been utilized to split a bilingual sentence pair. It looks for an optimal splitting point of a bilingual sentence pair at punctuations in accordance with some rules of thumb, and splits the bilingual sentence pair into two short sub-sentence pairs according to the optimal splitting point. Then, the above resulting sub-sentence pairs are split again recursively until there is no splitting point. This splitting method took into account the influence of alignment errors roughly. It aims at shortening the sentence length to decrease search space of parser tree corresponding to the sentence, but not improve the quality of word alignment.
As to the second problem “splitting long input sentences in decoding phase”, one of the solutions commonly used is to utilize an N-gram language model based on Hidden Markov Model (HMM). For example, the command “hidden-ngram” integrated in SRILM toolkit, i.e., hidden events occurring between words is utilized to label word sequence by use of N-gram model (here, the hidden events refer to ‘boundary’ and ‘no-boundary’). In particular, with respect to the problem of splitting a long sentence, it is to conduct sentence boundary labeling over each word in the long sentence, calculate probability score according to the N-gram language model, find out the most probable combination containing given word sequence and label sequence, and split according to the label results.
However, the most significant shortcoming of HMM is that it is based on assumption of output independence, which causes HMM fail to take into account context information.