1. Field of the Invention
The present invention relates generally to the field of multimedia content analysis, and more particularly, to a system and method for segmenting a video into semantic units using joint audio, visual and text information.
2. Description of the Related Art
Advances in modern multimedia technologies have led to huge and ever-growing archives of videos in various application areas including entertainment, education, training, and online information services. On one hand, this has made digital videos available and accessible to the general public; while on the other hand, it poses great challenges to the task of efficient content access, browse and retrieval.
Consider a video currently available at a website of CDC (Centers for Disease Control and Prevention), as an example. The video is approximately 26 minutes long, and describes the history of bioterrorism. Specifically, the content of the video consists of the following seven parts (in temporal order): overview, anthrax, plague, smallpox, botulism, viral hemorrhagic fevers and tularemia. Meanwhile, this website also contains seven other short video clips, with each clip focusing on one particular content part belonging to the above seven categories.
This availability of individual video segments allows for them to be assembled together as per some course objective, and is further useful in the sense that, when a viewer is only interested in one particular type of disease, he or she can directly watch the relevant video clip instead of looking it up in the original long video using fast forward or backward controls on a video player. Nevertheless, this convenience does not come free. With the current state of technology, it can only be achieved by either manual video segmentation or costly video reproduction.
Automatic video segmentation has been a popular research topic for a decade, and many approaches have been proposed. Among the proposed approaches, a common solution is to segment a video into shots where a shot contains a set of contiguously recorded frames. However, while a shot forms the building block of a video sequence in many domains, this low-level structure in itself often does not directly correspond to the meaning of the video. Consequently, most recent work proposes to segment a video into scenes where a scene depicts a higher-level concept. Various approaches have been reported as having received acceptable results. Nevertheless, a scene is still vaguely defined, and only applies to certain domains of video such as movies. In general, semantic understanding of scene content by jointly exploiting various cues in the form of audio, visual information and text available in the video has not been well attempted by previous efforts in the video analysis domain.
It would, accordingly, be advantageous to provide a system and method for segmenting a video sequence into a series of semantic units, with each semantic unit containing a generally complete and definite thematic topic.