The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also correspond to implementations of the claimed technology.
A video is a time-based media storage format for moving pictures information. A video may be described as a series of pictures or frames that are displayed at a rapid rate known as a frame rate which is the number of frames displayed in a second. Each frame is composed of elements called pixels that can be illuminated or darkened on a display screen. Resolution of a frame depends on the number of pixels present in a frame. Further, overall quality of a video varies depending on a number of factors such as the number of frames per second, colour space, resolution, and the like. A video apart from a number of sequenced frames may also comprise of an audio stream that adds to the content of the video and can be played by means of audio output devices such as speakers. However, the audio stream of a video may be in a specific language that may not be understood by viewers, for example, viewers who are not native speakers of the language used in the video. Moreover, in case a video is being viewed by a hearing impaired person, the video may not be fully understood.
For ease of use, a video may include written text called captions that may accompany the video. Captions may display text that transcribes the narration and provides descriptions of the dialogues and sounds that are present in a video. Captions are generally synchronized with the video frame so that the viewers can understand the content of the video that is presented, regardless of whether or not the viewer is able to understand the audio. There are two ways for embedding captions in a video namely closed captions and open captions. Closed captions can be toggled on/off and are embedded using a timed-text file which is created by adding time codes to a transcript of the video. Delivering video products with closed captions places responsibility on the viewer to understand how to turn on the captions, either on their television sets or in their media viewer software. For simplifying the use, open captions are preferred where the text is burnt-in to the frames such that they are visible whenever the video is viewed, i.e. textual information (like subtitles, credits, titles, slates, etc.) is burnt into a video such that it becomes a part of the frame data. Open captions are always present over the frames of the video and can't be toggled on/off, and no additional player functionality is required for presenting the open captions. Moreover, the open captions are added during the video editing process. Unlike closed captions where the textual information provided in a separate channel using text files or encoded files, the burnt in text is provided in the same channel as that of the video.
In media workflow, it is very important to have a method and a system implementing the method for detecting whether or not the captions have been inserted properly with-in the video before broadcast. However, it is not possible to validate the textual data semantically without extracting the data out of the video. Hence, the main requirement for such a system would be to detect the presence of burnt-in text within the video. Unlike closed captions where separate text files containing textual information or separate channel carrying encoded text information is present, it is difficult to validate burnt-in information in case of open captions without detection of text in a frame. Only after detection, textual information can be validated for its positioning, paint style, language, and the like.
There are many methods that claim to meet the above mentioned requirements, however, each method has its drawbacks. The existing methods suffer from a very high miss rate as they are specifically dependent on certain characteristics of the text burnt-in the video such as statistical characteristics, angular point characteristic, caption box and many more which are not universal and may vary from one video to another. Moreover, current state of art for detection of textual information present in open captions is not able to handle the text with different font sizes, different formatting, and different languages. So, there exists a need for detection of burnt in text in any video format and not just relying on certain text characteristics. In the current disclosure, a new method and system for detection of burnt-in text within a video is described that works properly for a wider range of text characteristics.