The present invention relates to automatic text correction. In particular, the present invention relates to automatic capitalization.
Text generated from user input often includes capitalization errors. This is especially common in text generated by speech recognition systems. Although such recognition systems typically include simple rules for capitalizing the first word of each sentence and a small set of known names, they consistently fail to capitalize many words in the text. As a result, the capitalization error rate is around 5% for speech recognition systems. This represents a significant contribution to the errors present in the text provided by the speech recognition system.
Automatic capitalization systems have been developed in the past. However, these past systems have been less than ideal.
Under one such system, the capitalization rules are developed based on a large corpus of documents. The systems use a large corpus because it is thought that a large corpus will provide better coverage of possible capitalization forms and will thus provide a more accurate capitalization system.
However, such systems have numerous deficiencies. First, because a large corpus is used, the capitalization rules themselves become very large making it inefficient to search through the capitalization rules for each word in the text. In addition, because the rules are derived from a large corpus, they are typically derived once before the model is shipped and are not updated after the model is shipped. As a result, the model does not adapt to new capitalization forms. Furthermore, a particular user may capitalize words differently than the unknown authors of the documents in the large corpus. As a result, the model may not behave in the way expected by the user.
In other systems, a list of acceptable capitalizations is generated by a linguistic expert. While this list is more condensed than the large corpus list, it is expensive to produce since it requires an expert's involvement.