Proper capitalization in text is a useful and often mandatory characteristic. Many text processing techniques rely on the text being properly capitalized, and many people can more easily read mixed-case text than monocase text (i.e., all lowercase or all uppercase). However, proper capitalization is often missing from many text sources, including automatic speech recognition output and closed captioned text. As may be appreciated, the value of these sources of text can be greatly enhanced when properly capitalized.
The presence of proper and correct capitalization is also becoming important due to the wide-spread use of Named Entity recognizers in various types of automatic document processing systems. Named Entity recognizers typically require proper capitalization in a document corpus for correct operation. However, some corpi, such as closed caption transcripts, are written in monocase.
Proper capitalization in text is often taken for granted. Most documents, such as newspaper articles, technical papers and most web pages are properly capitalized. Capitalization makes text easier to read and provides useful clues about the semantics of the text. Many text analysis systems exploit these semantic clues to perform various text processing tasks, such as indexing, parsing, sentence boundary disambiguation, extraction of named entities (e.g., people, places, and organizations) and to provide identification of relationships between named entities.
There are several text sources, without proper capitalization, that have experienced wider-spread use. Two of these sources are closed caption text from television broadcasts and the output from automatic speech recognition (ASR) systems. Closed caption text is an extremely valuable source of information about a television broadcast, essentially enabling the application of text analysis and indexing techniques on the audio/video television program. Closed caption text, however, is typically all upper case, which seriously impedes the effectiveness of many text analysis procedures. Moreover, all upper case text is more difficult to read when displayed on a computer monitor or television screen, or when printed on paper.
Automatic speech recognition has matured to the point where researchers and developers are applying ASR technology in a wide variety of applications, including general video indexing and analysis, broadcast news analysis, topic detection and tracking, and meeting capture and analysis. Although dictation systems built with ASR provide limited capitalization based on dictated punctuation and a lexicon of proper names, the more interesting application of ASR is in the area of speaker independent continuous dictation, which can be used to create a text transcript from any audio speech source. Systems that support this task typically provide a SNOR (Speech Normalized Orthographic Representation) output, which is in an all upper case format.
The ability to recover capitalization in case-deficient text, therefore, is quite valuable and worthy of investigation. Restoring proper capitalization to closed caption text and ASR output not only improves its readability, it also enables the use of a number of text processing tasks as mentioned previously. Even in those domains where capitalization is normally given, a system that recovers proper capitalization can be used to validate that the correct case has been used. Although capitalization rules exist, most are in fact merely conventions.
The recovery of capitalization from a source text has traditionally been rarely considered as a topic by itself. It is briefly discussed by Shahraray and Gibbon, “Automated Authoring of Hypermedia Documents of Video Programs,” Proc. of the Third ACM International Conf. on Multimedia, San Francisco, 1995, who describe a system that automatically summarizes video programs into hypermedia documents. Their approach relies on the closed caption text from the video, which must be properly capitalized. They describe a series of text processing steps based on Bachenko et al., J. Bachenko, J. Daugherty, and E. Fitzpatrick, “A Parser for Real-Time Speech Synthesis of Conversational Texts,” Proc. of the Third ACL Conf. on Applied Natural Language Processing, pp. 25-32, Trento, Italy, 1992, that includes rules for capitalizing the start of sentences and abbreviations, a list of words that are always capitalized, and a statistical analysis based on training data for deciding how the rest of the words should be capitalized.
In those applications where the proper case is normally expected but not available, a typical approach is to modify the program that relies on the existence of the proper case so that proper case is no longer required to complete the task. An example of such a task is Named Entity extraction on ASR output, a task that appears in DARPA sponsored Broadcast News workshops. One system that has performed especially well under these circumstances is Nymble (also known as IdentiFinder). Reference in this regard can be made to D. Bikel, S. Miller, R. Schwartz, and R. Weischedel, “Nymble: a High-Performance Learning Name-finder,” Proc. of the Fifth ACL Conf. on Applied Natural Language Processing, Washington, D.C., 1997, and to F. Kubala, R. Schwartz, R. Stone, and R. Weischedel, “Named Entity Extraction from Speech,” Proc. of the 1998 DARPA Broadcast News Transcription and Understanding Workshop, 1998.
Nymble is based on a Hidden Markov Model, which must be trained with labeled text. When the training data is converted to monocase, Nymble performs nearly as well on monocase test data as in a mixed case scenario.
Problems that exist with these conventional approaches to dealing with case-deficient text include a requirement to modify applications to support case-deficient text, or providing alternate training sets for every capitalization situation. Both of these approaches are less than desirable.