International e-commerce businesses involve a number of major languages that exist in the world, especially languages having a relatively broad coverage. In order to solve problems of language barriers that occur in information acquisition, searches and transactions, etc., machine translation technologies are needed.
World languages that have a broad range of applications can be classified into agglutinative languages, such as German, Finnish, Japanese and Arabic, for example. Due to a flexible combination mode of word roots and affixes, these languages have a large number of single words that are spliced together, which generally cannot be covered by a training corpus and thus belong to unlisted words. As such, a valid translation may not be obtained when being decoded using a machine translation, which severely affects the readability of a translated text. A German word “Leserkommentarspaltenhöllenlärm” is used as an example. This German word is formed by a combination of three semantic items. Being a synthetic word formed by three semantic items, this German word forms an unlisted word.
However, when this German word is partitioned and paraphrased into “Leserkommentar Spalten Höllenlárm”, a suitable translation can be obtained. Apparently, this problem cannot be perfectly resolved by a simple partition. For example, “” may be divided into “” and “”, which are then translated from Chinese into English respectively. A translated text that is obtained thereby may be “middle country”. Moreover, after partition, an unlisted word may generate more unlisted words. For example, “x1x2x3x4” may be divided into “x1x2 x3x4”. This not only increases the number of unlisted words, but also potentially scatters these two unlisted words in a translated text due to an ordering adjustment, thereby further weakening the readability of the translated text.
Existing partition and paraphrase methods mainly rely on training a model by using a corpus that is generated through manual word segmentation and grammatical labeling by linguists as a training text, and perform a word segmentation using the trained model. If a difference between a text to be translated and a trained model is large, more resources are needed to update the existing trained model.
In view of the above, a problem exists that more resources are needed to update an existing trained model when a significant difference between a text to be translated and a trained model exists in existing word segmentation technologies.