There exists a substantial volume of video and multimedia content which is available both online, such as via the Internet, and offline, such as in libraries. In such video and multimedia content, it is common for a text caption box to be embedded in the video to provide further information about the video content. For example, as illustrated in FIG. 10, a video recording of a baseball game typically includes a caption box 1010 which displays game statistics such as the score, inning, ball/strike count, number of outs, etc. The detection and recognition of the text captions embedded in the video frames can be an important component for video summarization, retrieval, storage and indexing. For example, by extracting a short video segment preceding certain changes in the text of the baseball caption box, such as score or number of outs, a “highlight” summary can be automatically generated.
Text recognition in video has been the subject of current research. For example, the article “Video OCR: Indexing Digital News Libraries by Recognition of Superimposed Captions,” by T. Sato, et al., Multimedia Systems, 7:385-394, 1999 discloses a system for detecting and recognizing text in news video. This system is described as using a line filter to enhance the text characters and a projection histogram to segment the characters. A dynamic programming algorithm is used to combine the segmentation and recognition processes to reduce the false alarms of character segmentation.
Past approaches to text detection in video do not adequately account for disturbances in the background areas. As a result, previous approaches are often sensitive to cluttered backgrounds, which diminish text recognition accuracy. Therefore, there remains a need for improved methods of extracting text embedded in video content. There also remains a need to improve automatic video summary generation methods using text which is extracted from the video content.