The present invention relates to the detection and interpretation of textual captions in video signals and, more particularly, to the identification and interpretation of text captions in digital video signals.
Text captions, or more simply, captions, are commonly used in domains such as television sports and news broadcasts, to convey information that complements or explains the contents of the audio and video information being presented. For example, captions might present game scores and player names in sporting events, or places, situations, and dates in newscasts. FIG. 1 illustrates typical captions found in television broadcasts.
The "text captions" herein referred to should be distinguished from the "closed-captions" used, for example, in broadcast programs to assist the hearing impaired and other audiences interested in live textual transcripts of the words spoken in the sound channel of a broadcast. These captions are transmitted in a separate channel of the signal and are "overlaid" into the screen display of a television receiver by means of a special decoder. The captions herein addressed are textual descriptions embedded in the video signal itself.
It is herein recognized that an automatic caption detector can be used to address the following two principal problems in digital video management. The first problem relates to indexing--captions can be interpreted with an optical character recognition (OCR) system and used to generate indexes into the contents of the video., e.g. names of persons, places, dates, situations, etc. These indexes serve as references to specific videos or video segments from a large collection. For example, a person could request clips of a particular sports star or when a certain politician appears, or about a certain place. The second problem relates to segmentation--captions can be used to partition the video into meaningful segments based on the visual cues they offer to the viewer. In FIG. 2a, for example, captions were used to segment a video of an interview including a number of individuals. Using the browser one can quickly identify the person of interest and begin playing the corresponding segment (FIG. 2b).
In both cases the availability of knowledge about specific domains increases the usefulness of automatic caption detection. For example, in basketball games captions are regularly used to report the score of the game after every field-goal or free-throw. The ability to identify and interpret these captions can then be used to generate triplets of the form (time, TeamAScore, TeamBScore), which could latter be used to answer queries such as "show me all the segments of the game where TeamB was ahead by more than 10". Similarly, they can be used to create a score browser which would enable a person to move directly to specific portions of the video; see FIG. 2c.
The present invention is intended to be implemented by programmable computing apparatus, preferably by a digital computer. Thus, operational steps herein referred to are generally intended to signify machine operations
In the detailed description of the invention that follows, features of the invention will be disclosed for enabling the detection, interpretation and classification of textual captions. Definitions are introduced and an outline of the methodology provided, followed by a disclosure of a caption detection algorithm and its application in the context of the invention.
As hereinbefore discussed, captions in the present context are those textual descriptors overlaid on a video by its producer. More specifically, captions are considered to exhibit the following characteristics. Captions do not move from frame to frame, i.e., they remain in the exact same location in each frame regardless of what is happening in the rest of the scene. Captions remain on the screen for at least a minimum period of time, i.e., they will appear in a plurality of consecutive frames. This is important because it enables sampling of the video to detect captions, and because the redundancy can be used to improve the accuracy of the method.
Captions are intended to be read from a distance. Thus, there are minimum character sizes that can be utilized in making a determination as to whether a video segment contains text.
It is herein recognized that non-caption text may appear in video. For example, street signs and advertisements will typically appear in outdoor scenes. Text is also often found in commercial broadcast material. In both cases if one or more of the aforementioned characteristics is violated (e.g., the text in a street sign may move in a fast action shot), the text will not be detected in accordance with the present invention.