Machine translation refers to the use of an electronic computer to implement automatic translation from a text of a natural language (a source language) to a text of another natural language (a target language), and a software component that implements this process is called a machine translation system. With the development and popularity of electronic computers and the Internet, cultural exchanges between nations have become more frequent, and the problem of language barrier is emerged again in a new era. Thus, machine translation has become more urgently needed by people than ever before.
Machine translation methods may be classified into Rule-Based and Corpus-Based types. The former one constructs a knowledge source from a dictionary and a rule base. The latter constructs a knowledge source from a corpus that is classified and tagged, without the need of any dictionary or rule, but depends mainly on statistical regularities. Corpus-based methods may be classified into a Statistics-based method and an Example-based method. The above machine translation methods are described in brief as follows.
1) Rule-based Machine Translation Method
This method is generally performed with the aid of dictionaries, templates and manually organized rules. Original texts of a source language to be translated need to be analyzed, meanings of the original texts are represented, and equivalent translated texts of a target language are then generated. A good rule-based machine translation apparatus needs to have enough number of translation rules with a broad enough coverage, and is further able to solve conflict problems among the rules effectively. As the rules generally need to be sorted out manually, the labor cost is high, and it is difficult to obtain a large number of translation rules with a very comprehensive coverage. Furthermore, the probability of having conflicts between translation rules provided by different people is high.
2) Example-based Machine Translation Method
This method is based on examples, and mainly uses pre-processed bilingual corpora and translation dictionaries to perform translation. During a translation process, segments matching with original text segments are first searched from a translation example base, corresponding translated text segments are determined, and the translated text segments are then recombined to obtain a final translated text. As can be seen, the coverage and storage method of translation examples directly affect the translation quality and speed of this type of translation technology.
3) Statistics-based Machine Translation Method
A basic idea of this method is to perform statistics on a large number of parallel corpora to construct a statistical translation model, and to perform translation using the model. Early word-based machine translation has been transited to phrase-based machine translation, and syntactic information is being integrated to further improve the accuracy of the translation.
The method is based on a bilingual corpus, in which translation knowledge in the bilingual corpus is represented as a statistical model through a machine learning method and translation rules are extracted, and then original texts to be translated are translated into translated texts of a target language according to the translation rules. The statistics-based machine translation method requires less manual processing and has a quick processing speed, being independent of specific examples and not limited by application fields. Therefore, this method has apparent advantages as compared with the other two machine translation technologies, and is a method having a relatively good performance among existing unlimited-field machine translation technologies.
In view of the above, the statistics-based machine translation method is currently the most commonly used machine translation method as compared to the former two methods. Since the 1990s, the statistics-based machine translation method has been developed rapidly, and gradually becomes core content in the research field of machine translation. During that period of time, scholars have proposed a number of statistics-based machine translation methods, including word-based, phrase-based, level phrase-based, syntax-based and semantics-based statistical machine translation methods.
The existing semantics-based statistical machine translation method is a statistical machine translation method completely based on semantics. The defects of the method are very obvious. First, a form of a semantic expression used in this type of translation method is over-complicated and is not general enough (that is, a large difference between expression forms of a same semantic meaning in different languages exists). Furthermore, the degree of difficulty in establishing a semantic analyzer of a specific language is extremely high. Therefore, it is difficult to use a language expression structure as an “intermediate language” used for the translation method. Second, semantic translation rules that are obtained by a statistical machine translation system completely based on semantics are generally overly redundant. Therefore, this type of translation method currently remains at a phase of theories and experiments only, and cannot be used in batch in industrial fields.
For other existing statistics-based machine translation methods, when a machine translation model is constructed, a semantic level of a natural language is not thoroughly analyzed, thus leading to deviations between a semantic meaning of a translated text that is generated and a semantic meaning of an original text. Accordingly, a translation effect of semantic consistency cannot be achieved, thereby severely reducing the quality of the machine translation. For example, a word “apple” in “the apple product” of a source language of English expresses a semantic meaning of “Apple Company”. If this word is translated into an “apple” of food, a semantic deviation is resulted, thereby severely reducing the user experience of a user.
In view of the above, the statistical machine translation method that is completely based on semantics needs to use a complicated structure of semantic expressions, and therefore the practicability of the method is poor. Other statistics-based machine translation methods do not consider information of semantic components, and therefore a semantic inconsistency problem may occur when languages having large syntactic and semantic differences are processed, thus resulting in an unreadable translation result even though “each word is correct”.
Generally, the problem of deviation of a semantic meaning of a translated text from a semantic meaning of an original text may be relieved by obtaining a high-quality and large-scale bilingual parallel corpus in a statistics-based machine translation method. However, obtaining a high-quality and large-scale bilingual parallel corpus is difficult from a lot of languages. As can be seen, obtaining a high-quality and large-scale bilingual parallel corpus to relieve the problem of having a difference between semantic meanings of a translated text and an original text in a statistics-based machine translation method is not an effective method.
In short, a problem of semantic inconsistency between an original text and a translated text exists when an existing statistics-based machine translation method is used for translation.