1. Technical Field
The disclosed embodiments relate in general to systems and methods for video content indexing and, more specifically, to computerized systems and computer-implemented methods for enhanced multi-modal lecture video indexing.
2. Description of the Related Art
Recording video is now an everyday activity, as it has become more common to own devices such as video cameras, smartphones and digital cameras. In addition, thanks to a wide range of services and infrastructures available for handling video, we can easily find and watch videos online. However, videos without any metadata, such as date, creator, keywords, or description, cannot be found using typical search engines. Metadata is usually added manually, and the process is very time consuming. Furthermore, even if a video can be found by its metadata, search engines normally are not capable of finding a specific scene of interest within the video.
U.S. Pat. No. 8,280,158 describes a lecture video search engine. This system distinguishes itself from other video search engines by focusing on the lecture video genre and exploiting the videos' structuring according to slides. The described system automatically detects slides appearing in the videos and processes slide frames to extract their text using optical character recognition (OCR). A powerful slide keyframe-based interface allows users to efficiently navigate inside the videos using text-based search.
Another source of text that describes a broader class of video is text extracted from the audio track, typically using automatic speech recognition (ASR). In previous published work Matthew Cooper, Presentation video retrieval using automatically recovered slide and spoken text, In Proc. SPIE, volume 8667, 2013, the characteristics of the text derived from slides and from speech were established and contrasted in controlled experiments with manual ground truth. Spoken text from ASR or closed captions (CC) is typically dense, improvised, and contains generic terms. Errors in ASR occur at the word level due to acoustic mismatch and degrade retrieval performance. Slide text from either the slide file or extracted from slides using an OCR is relatively sparse, but contains discriminative terms as the product of an authoring process. Errors in automatically extracted slide terms occur at the character level and have a negligible impact on retrieval.
Thus, new and improved systems and methods are needed that would integrate text derived from the speech in the audio track and the slides in the video frames for enhanced multi-modal video indexing.