1. Field of the Invention
The present invention relates to the field of video abstraction and archiving, and more specifically, a video organization and indexing system, which uses closed-caption information of the video and natural language processing tools, to enable content-based abstraction and archival of videos.
2. Description of the Prior Art
For a multimedia information system to better meet the users"" needs, it must capture the semantics and terminology of specific user domains and allow users to retrieve information according to such semantics. This requires the development of a content-based indexing mechanism, which is rich in its semantic capabilities for abstraction of multimedia information, and also provides canonical representation of complex scenes in terms of objects and their spatio-temporal behavior. A key initial stage in this content-based indexing process is video organization. The objective of video organization is to capture the semantic structure of a video in a form that is meaningful to the user, i.e. providing a video table of contents analogous to the table of contents in a book.
There have been two different approaches to video organization. The research efforts in database systems has mostly focussed on attribute-based indexing of multimedia information which entails a level of abstraction that reduces the scope for posing ad hoc queries to the database. This is described by P. England, R. B. Allen, M. Sullivan, A. Heybey, M. Bianchi, and A. Dailianas in xe2x80x9cI/Browse: The Bellcore Video Library Toolkitxe2x80x9d, Storage and retrieval for Still Image and Video Databases, SPIE, pp. 254-264, February 1996. On the other hand, with the automatic approach, the research in computer vision relies on an integrated feature extraction/object recognition subsystems to segment video into meaningful semantic units. This is described by M. M. Yeung and B. L. Yeo in xe2x80x9cTime-constrained Clustering For Segmentation Of Video Into Story Unitsxe2x80x9d, International Conference on Pattern Recognition, C, pp. 375-380, 1996; H. J. Zhang, Y. H. Gong, S. W. Smoliar and S. Y. Liu in xe2x80x9cAutomatic Parsing Of News Videoxe2x80x9d, International Conference on Multimedia Computing and Systems, pp. 45-54, 1994; and D. Swanberg, C. F. Shu and R. Jain in xe2x80x9cKnowledge Guided Parsing In Video Databasesxe2x80x9d, Storage and Retrieval for Image and Video Databases, SPIE vol. 1908, pp. 13-25, 1993.
Both approaches to video organization have their own limitations. The attribute-based approach needs a human operator to manually index the multimedia information, but the automatic approach is computationally very expensive, difficult, and tends to be very domain specific. It is nearly impossible to obtain useful video organization in practice based solely on automatic processing.
In addition, automatic approaches do not include closed-caption information analysis to enhance their results. Nowadays, many videos are made available with closed-captioned text or transcripts (in Europe). These include all major news broadcasts, documentaries and motion pictures. Live action video feed is also being closed-captioned online in some cases. While closed-captioned text is intended to aid the hearing-impaired, it can be used to great advantage in the organization and indexing of video for archiving and browsing. With the availability of attached text, words could be used as features for comparing video segments instead of or in addition to visual features extracted from the video frame images. Natural language keywords have much more descriptive power and are much easier to use than abstract image features that often do not correspond to the perceived features of the image. In addition natural language keywords provide higher semantics, thus enabling real content-based video archiving and retrieval. Retrieval based on text has been a focus of research for a long time and powerful tools are available for indexing databases by natural language keywords. Advanced natural language processing tools are also becoming increasingly available. Therefore, it is important to try to use the textual information added to the video to enhance the results obtained from processing the audio and video components of the video alone. However, the closed-caption comes with its costs. It is usually not aligned with the audio-visual information. Often the closed-caption sentences are not complete, and contain misspelled words. Hence, it is believed that the human operator has to be in the loop to correct the automatically produced results and give feedback to them.
An improvement would be a hybrid approach that uses the closed-caption and audio information in addition to the visual information. Thus, the system should automatically segment the video and create the video table of contents in a preprocessing step, while providing an easy-to-use interface for verification, correction and automatically extracted video structure. It is an object of the present invention to provide such a hybrid system for generating organized video, where the video is divided into distinct stories that are further segmented into separate speaker blocks if there are multiple speakers within them. Besides Video Table Of Contents (VTOC) generation, it is an object of the present invention that the system be supported by many other automatic video organization methods, including scene cut-detection, shot grouping based on visual similarity, audio segmentation into music, speech and silence, proper noun extraction from closed-caption, and division of video into different story units by closed-caption analysis.
The present invention is directed to a system for organizing digital videos to archive and access them at different levels of abstraction. The present invention includes a computer readable storage medium having a computer program stored thereon performing the step of using the data available from the closed-caption text along with off-the-shelf natural language processing tools to segment the video into self-contained story sections and speaker blocks. In further detail, if the subject changes are marked, the system uses these points to divide the video into distinct stories which are represented as nodes attached to the root node in a tree structure and groups speaker segments belonging to a story under the story node as its children. If the subject changes are not marked, the system uses the observation that some common elements will be present when talking about the same subject, such as keywords like names of people, places, organizations etc., thus uses proper nouns to group similar segments into stories by considering the temporal proximity before grouping them into the same story. The system also checks and modifies (if necessary) the results obtained at the previous steps of the video organization using the interactive user interfaces which also provide for proceeding seamlessly from one processing step to the next.