Exemplary embodiments of the present invention relate to a method for establishing paraphrasing data for a machine translation system, and more particularly to a method for establishing paraphrasing data for a machine translation system, which can improve the performance of machine translation through automatic establishment of the paraphrasing data of a source language.
In general, a machine translation technology means a technology to automatically convert one language into another language using a natural language processing technique in order to solve the communication problem due to the language barrier.
Among several methods for machine translation, researches for a statistical machine translation (SMT) technology to learn parameters of a model through statistical analysis of a bilingual corpus and to translate an input sentence based on the model have been actively made.
Further, statistical model used in the statistical machine translation have gradually been high-leveled, and researches for a paraphrasing method have been made for an effective translation of idiomatic phrase expressions.
In order to use such a paraphrasing method, it is important to establish paraphrasing data of a source language. A method for establishing paraphrasing data in the related art may be classified into a method using a bilingual corpus and a method for manually establishing paraphrasing data from source language sentences.
First, the method using a bilingual corpus is configured to compare all pairs of source language sentences and object language sentences of the bilingual corpus, assume all source language sentences having the same object language sentence as one paraphrasing sentence, and extract paraphrasing data in the unit of a sentence between sets of the source language sentences.
However, this method has the problem that it is unable to be applied if there is no bilingual corpus. Further, paraphrasing in the unit of a sentence has a narrow application range, and if paraphrasing in the unit of a word or syntax is required, this method is unable to be properly applied.
On the other hand, the method for manually establishing paraphrasing data from source language sentences has problems from three aspects as follows.
First, since it is not easy to accurately define to what level the paraphrasing is to be performed, it is difficult to establish the paraphrasing data.
Second, since the paraphrasing data is manually established, there is a lack of consistency in this method.
That is, since there is a difference between persons in determining the level of the paraphrasing, whether the paraphrasing data has been established and the results of the paraphrasing establishment with respect to the same sentence may differ to cause a lack of consistency.
Last, this method has the problem that the paraphrasing data may occur irrespective of the improvement of the machine translation performance.
That is, since the result of the paraphrasing for language education may differ from the result of the paraphrasing for machine translation, the paraphrasing data for language education may be of no use in improving the machine translation performance.