With the development of the Internet, the requirement for network search is getting higher and higher. Therefore, more key words and corpora are required to be stored in a cloud corpus server for a netizen to use when in search on a network.
The commonly used parallel corpora at present are bilingual/multilingual corpora consisting of an original text and a parallel translated text, and the alignment degree thereof consists of a word level, a sentence level, a paragraph level and an article level. According the different translation directions, the parallel corpora are divided into three forms: uni-directional parallel corpora, bi-directional parallel corpora and multi-directional parallel corpora.
At present, the establishment of a parallel corpus requires the aid of auxiliary means which generally comprise the steps of de-drying, segmentation, punctuation processing, alignment mark adding, parallel and the like. In the process of establishing a parallel corpus, a great deal of manpower and material resources are consumed, and the updating of the corpus is not timely, thus the real time of the corpus cannot be guaranteed.
Existing technologies also adopt a distance edition method to expand a corpus through the operations of deleting, shifting, interpolating and the like, but the practical operation process is bothersome.
The expression modes of languages are rich and varied, and a sentence may be formed only by randomly combining several words. If a corpus sequentially acquires and inputs all the corpora, a great many of efforts are required to be invested and omission is easy to occur. Furthermore, the establishment of a parallel corpus at present requires the participation of manpower, and requires a large number of industrial professionals and translators to consume a lot of time and energies.