One of the primary goals of machine translation is providing a high-quality result. Different technologies use various methods and approaches. For example, deep analysis of a source sentence based on an exhaustive linguistic descriptions of the language of the source sentence, and an independent semantic structure that reveals the meaning of the source sentence, each can be used to translate the source sentence into another target language. These also can be used to synthesize a translation that conveys the meaning of the source sentence to a target language using universal semantic descriptions and linguistic descriptions of the target language. The analysis and/or synthesis may make use of various types of statistics and ratings that are produced during analysis of parallel corpora to improve the accuracy of the translation. For example, the Natural Language Compiler (NLC) system uses linguistic descriptions (including, but not limited to, morphological, syntactic, semantic, and pragmatic descriptions) in constructing an independent syntactic structure for a sentence and in synthesizing the corresponding sentence in the language to which it is translated. The Natural Language Compiler (NLC) system is based on technology for universal language modeling and production of language structures (a model-based machine translation system), and is described in U.S. patent application Ser. No. 11/548,214, which was filed Oct. 10, 2006, and also in U.S. patent application Ser. Nos. 11/690,099, 11/690,102, and 11/690,104, each filed Mar. 22, 2007. In general, the more precise the lexical description of the language is, the higher the probability of obtaining an accurate translation.
The ambiguity of natural language, homonymy, and presence of multiple lexical variants and differing syntactic models can occasionally give rise to a large number of possible variants of parsing and translations of a source sentence during a machine translation process. A person can intuitively handle this type of issue using knowledge of the world and context of the source sentence, however, this issue can be problematic for a machine translation system. To achieve better results for a machine translation (e.g., as close as possible to a human translation), a number of ratings of links found during a system training stage are used by the NLC system for analyzing natural language. Many different types of ratings can be used, such as ratings of lexical meanings (frequency and probability of use) of words, ratings of additional surface slots and deep slots, ratings of a semantic description, and so forth. As such, the quality of the translation depends on the accuracy of the selected ratings, as the NLC system makes a selection of the best syntactic structure for the source sentence based on the ratings. The general syntactic structure of the sentence is a structure that describes the construction of the sentence, and indicates the main words, modifiers, and links therebetween.
This selection is currently done manually using parallel analysis of tagged text corpora. A manually-tagged corpus is a corpus in which a set of grammatical values is assigned to each word, and where each sentence has a single syntactical structure. A grammatical value is an element of a grammatical category, which is a closed system that provides a partition of a comprehensive set of word forms into non-intersecting classes, between which substantive differences are shown as the grammatical value. Examples of grammatical categories include gender, number, case, animacy, transitivity, and so forth.