Grammatical elements include function words, which are those words that do not have semantic meaning in a text fragment. An example of such function words are Japanese case markers, which indicate grammatical relations (such as subject, object, and location) of the complement noun phrase to the predicate. Other grammatical elements include inflections, such as inflections to indicate number, tense, gender, etc. For instance, the ending of the word “come” changes based on the number of the noun associated with it (i.e., I come, she comes).
Generation of grammatical elements using natural language processing technology has become an important technology. This is particularly true in the context of machine translation. In an English-to-Japanese machine translation system, for example, Japanese case markers are among the most difficult to generate appropriately. This is because the case markers often do not correspond to any word in the source language (i.e., in English), since many grammatical relations are expressed by word order in English.
Generating Japanese case markers is also difficult because the mapping between the case markers and the grammatical relations they express is very complex. For the same reasons, generation of case markers is challenging to foreign language learners.
Machine translation is the process by which a computer receives a text fragment in a source language, and translates it into a corresponding target language text fragment. Generation of grammatical elements has become an important component technology in the context of machine translation.
Statistical machine translation systems, however, have not yet successfully incorporated components that generate grammatical elements in the target language. State of the art statistical machine translation systems treat grammatical elements in exactly the same way as content words, and thus rely on phrasal translations and target language models to generate these elements. However, since these grammatical elements in the target language often correspond to long range dependencies or may not have any corresponding word in the source language (or both), the output of the statistical machine translation system is often not grammatically correct.
For example, Table 1 below shows an output from an English-to-Japanese statistical machine translation system on a sentence from a computer domain. The source sentence is labeled “S” and reads “The patch replaces the .dll file.” The output is labeled “O” and includes three lines. The first line shows the Japanese characters, the second line is the phonetic spelling of the Japanese characters using the English alphabet, and the third line is the English translation. The correct translation is labeled “C” and includes the same three lines.
The conventional statistical machine translation system, trained on this domain, produces a natural lexical translation for the English word “patch” as “correction program”, and translates “replace” into passive voice, which is more appropriate in Japanese. However, as can be seen from Table 1, the case marker assignment is problematic. The accusative marker “wo”, which was output by the machine translation system, is completely inappropriate when the main verb is passive.
TABLE 1S: The patch replaces the dll file.O: shuusei purogurams-wo dll fairu-ga okikae-raremasucorrection program-ACC dll file-NOM replace-PASSC: shuusei purogurams-de .dll fairu-ga okikae-raremasucorrection program-with dll file-NOM replace-PASS
This illustrates only a few difficulties in predicting Japanese case markers. Similar problems exist in generating other grammatical elements in machine translation.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.