Automatic translation systems can currently be split according to at least two modes.
In a first mode, the translation systems are based on a morphosyntactic analysis of the sentences to be translated and then on a transfer and the generation of the translated sentences. The only semantics used relate generally to the usage restrictions incorporated in the bilingual dictionaries stored in databases. To obtain an understandable translation in a field, it is best to spend time adding the specialist vocabulary and adding the restrictions of meaning to the dictionaries. The quality of the translations provides a certain overall degree of comprehensibility of the texts but does not constitute a translation that can be used without extensive editing. Furthermore, these systems take little account of the language usages that make one sentence acceptable whereas another, although correct syntactically and semantically, is not acceptable.
In a second mode, the systems make use of translation memories which use, for the translation, the resemblance of the sentences to be translated with an already translated sentence. For this, it is necessary to have numerous translated texts and align the sentences, translations of one another, using an alignment algorithm. Monolingual information search techniques, in the source language, are then used to search for the sentence that is closest to a sentence to be translated in the same language. The translation is provided through the preliminary alignment of the sentences of the texts already translated. These systems are notably used to translate technical documents for products given that, from one version to another, the texts do not vary very much.
Moreover, on the basis of technologies used for speech recognition, an automatic translation approach has been developed that is inspired by translation memories. This involves using statistical methods relating to successions of a number n of words, normally three, to exploit the translated texts and calculate the probability of their translations being found in the other element of the bilingual text. These techniques have proved better than the conventional automatic translation systems. However, while in the field of news for example, there are enough corpora, including bilingual, in particular between English and the languages that are economically the most important, in all the other fields these techniques do not have sufficient data to be operational. Furthermore, they do not exploit all of the knowledge that is contained in the rare bilingual texts that exist through the use of N grams or successions of N words taken into account to create language models. The succession of 2, 3 or N words then has to be identical to what is to be translated, which is generally too restrictive. In practice, the strict succession constraint is highly restrictive. For example, translating “il mange souvent du chocolat” from French to English as “he often eats chocolate” does not make it possible to translate “il mange du chocolat”. Similarly, the words must be strictly identical. Thus, “Le gâteau est bon” which is translated as “The cake is good” does not make it possible to translate “Les gâteaux sont bons”. This is notably due to the fact that there is generally no linguistic processing in such systems because they are purely statistical systems.
Moreover, another major drawback stems notably from the large volume of data used to learn the translation. In practice, the quantity of available bilingual texts is infinitely smaller than the texts available in the target language alone, that is to say, in the language to be translated. This is all the more significant if the aim is, for example, to translate from a rare language to a common language, notably like French or English, cases where the bilingual texts are rare or even nonexistent on computer media.