1. Field
Embodiments of the invention generally relate to the field of automated translation of natural-language texts using linguistic descriptions and various applications in such areas as automated abstracting, machine translation, natural language processing, control systems, information search (including on the Internet), semantic Web, computer-aided learning, expert systems, speech recognition/synthesis and others.
2. Related Art
The ability to understand, speak and write one or more languages is an integral part of human development to interact and communicate within a society. Various language analysis/synthesis approaches have been used to dissect a given language, analyze its linguistic structure in order to understand the meanings of a word, a sentence in the given language, extract information from the word, the sentence, and, if necessary, translate into another language or synthesize into another sentence, which expresses the same semantic meaning in some natural or artificial language.
Prior machine translation (MT) systems differ in the approaches and methods that they use and also in their abilities to recognize various complex language constructs and produce quality translation of texts from one language into another. According to their core principles, these systems can be divided into the following groups.
One of the traditional approaches is based on translation rules or transformation rules and is called Rule-Based MT (RBMT). This approach, however, is rather limited when it comes to working with complex language phenomena. In the recent years no significant breakthroughs have been achieved within this field. The best known systems of this type are SYSTRAN (SYSTRAN S. A., Paris, France), PROMT (PROMT OOO, Sankt Petersburg, Russian Federation) and ETAP-3 (Institute For Information Transmission Problems, Moscow, Russian Federation). The known RBMT systems, however, usually possess restricted syntactic models and simplified dictionary descriptions where language ambiguities are artificially removed.
Rule-based concept has evolved into Model-Based MT (MBMT) which is based on linguistic models. Implementing a MBMT system to produce quality translation demands considerable effort to create linguistic models and corresponding descriptions for specific languages. Evolution of MBMT systems is connected with developing complex language models on all levels of language descriptions. The need in today's modern world requires translation between many different languages. Creating such MBMT systems is only possible within a large-scale project to integrate the results of engineering and linguistic research.
Another traditional approach is Knowledge-Based MT (KBMT) which uses semantic descriptions. While the MBMT approach is based on knowledge about a language, the KBMT approach considers translation as a process of understanding based on real knowledge about the World. Presently, interest in Knowledge-Based Machine Translation (KBMT) has been waning.
Example-Based MT (EBMT) relates to machine translation systems using automated analysis of “examples”, which is very similar to Statistics-Based MT (SBMT). The best known systems of this type is Google-translator (Google, Inc., Mountain View, Calif., USA), as well as translation engines with language-specific rules-based elements, such as Microsoft Bing Translator (Microsoft, Inc., Redmond, Wash., USA) and Yahoo Babelfish (Yahoo! Inc., Sunnyvale, Calif., USA). In recent years, the SBMT approach has received a strong impetus from the following factors: appearance of Translation Memory (TM) systems and availability of powerful and relatively affordable bilingual electronic resources, such as TM databases created by corporations and translation agencies, electronic libraries, and specialized Internet corpora. The TM systems have demonstrated their practical efficiency when translating recurrent text fragments on the basis of minimal knowledge about languages such that researchers and developers are encouraged to try and create advanced and relatively exhaustive SBMT and HBMT (Hybrid Based MT) systems.
Most machine translation systems, both rule-based and statistics-based, concentrate on proper transfer of language information directly between a source sentence and an output sentence and usually do not require any full-fledged intermediary data structures to explicate the meaning of the sentence being translated. For example, a system based on linguistic models would know how to build thousands of syntactic variants of verb phrases-constituents. A system which is based on purely statistical approach would not know anything about the connections between these variants and would not be able to obtain a correct translation of one phrase on the basis of another. In addition, most-used probabilistic (statistic) approaches and statistics-based systems have a common drawback of taking no consideration of semantics. As a result, there is no guarantee that the translated (or generated) sentence has the same meaning as the original sentence.
Thus, even though some linguistic approaches have been proposed, most of them have not resulted in any useful algorithms or industrial applications because of poor performance in translating complete sentences. Complex sentences, which may express different shades of meaning, or the author's attitude and/or have different styles or genre, or which may be very long and contain various punctuation marks and other special symbols, have not been successfully generated/translated by prior art systems, language generation programs, or machine translation systems. It is especially difficult to translate or generate complex sentences, such as those found in technical texts, documentation, internet articles, journals, and the like and is yet to be done.
Accordingly, there are many ways to improve the methods and systems for translating natural language sentences between languages.