The present invention generally relates to an apparatus and methods for processing video and specifically with the problem of caption detection in videos. FIG. 1 depicts captions, such as text or logos, which are superimposed on videos during the postproduction process which generally provide information related to the broadcaster or the video content being provided. Examples of captions include scores of sporting events, text related to the audio of the video program, logos of the broadcaster, or the like.
Detecting captions is useful for a variety of applications, for example, enhancing the perceived quality of small-sized videos for mobile devices by highlighting caption areas, or extracting metadata from text areas for video indexing and search. Caption detection is a key step of the systems for the above mentioned applications.
For applications such as caption highlighting to enhance video quality and metadata extraction, the stability and consistency of caption detection is very important, because if the detected caption boxes are not stable over time, the following video enhancement component could generate temporal artifacts, such as flickering on videos, due to inconsistent caption boxes for a caption area that stay on the screen for some time.
Previous methods performed caption detection in two steps implementing a smoothing approach as shown in FIG. 2. The first step extracts visual features, such as color, motion, or texture from images/videos and creates a binary map that identifies the pixels likely belonging to a caption area. The second step groups the identified pixels and generates the bounding boxes specifying the location and size of text areas. For the second step, these systems first generate 2D bounding boxes and then use a filtering process to smooth the detected 2D bounding boxes. However, this smoothing approach cannot completely get rid of the inconsistency of the caption detection results.
Another approach as depicted in FIG. 2 teaches a first step of extracting visual features, such as color, motion, or texture from images/videos and creates a binary map that identifies the pixels likely belonging to a caption area. A second step groups the identified pixels and generates the bounding boxes specifying the location and size of text areas. The detected bounding boxes are smoothed and stabilized over time under the assumption that captions usually stay on the screen for some time. To implement this second step, a temporal consistency check and smoothing is carried out to make the bounding boxes more temporally consistent. Although this approach alleviates the instability problem it does not necessarily completely eliminate the inconsistency of caption detection. As a result, temporal jittering of the detected bounding boxes is still a undesirable result.
It would be desirable to overcome the above listed problems and make the results of caption detection stable and consistent over time. The stability and consistency of caption detection over time is important for several related applications, such as video quality improvement, because unstable detection results could result in visible temporal artifacts, such as flickering or/and jittering.
This section is intended to introduce the reader to various aspects of art, which may be related to various aspects of the present invention that are described below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present invention. Accordingly, it should be understood that these statements are to be read in this light, and not as admissions of prior art.