1. Field of the Art
The present invention relates generally to machine translation, and more particularly to capitalizing machine translated text.
2. Description of Related Art
Capitalization is the process of recovering case information for texts in lowercase. Generally, capitalization improves the legibility of texts but does not affect the word choice or order. In natural language processing, a good capitalization model has been shown useful for name entity recognition, automatic content extraction, speech recognition, modern word processors, and an automatic translation system (sometimes referred to as a machine translation system or an MT system). Capitalization of output from the automatic translation system improves the comprehension of the automatically translated text in a target language.
Capitalization of automatically translated text may be characterized as a sequence labeling process. An input to such labeling process is a lowercase sentence. An output is a capitalization tag sequence. Unfortunately, associating capitalization tags with lowercase words can result in capitalization ambiguities (i.e., each lowercase word can have more than one tag).
One solution to resolve capitalization ambiguities for automatically translated text is a 1-gram tagger model, where the case of a word is estimated from a target language corpus with case information. Other solutions for capitalizing automatically translated text treat capitalization as a lexical ambiguity resolution problem. Still some solutions to resolve capitalization ambiguities include applying a maximum entropy Markov model (MEMM) and/or combining features of words, cases, and context (i.e., tag transitions) of the target language.
These solutions are monolingual because the solutions are estimated only from the target (monolingual) text. Unfortunately, such monolingual solutions may not always perform well on badly translated text and/or source text that includes capitalization based on special use.