1. Technical Field
The invention is related to detecting particular video sequences in a multimedia broadcast stream, and in particular, to a system and method for automatically detecting and segmenting music videos in an audio-video media stream.
2. Related Art
Multimedia data streams such as, for example, audio-video streams including music or songs can be found in a number of environments, such as, for example, television broadcasts, or streaming data across a network such as the Internet. However, when such streams are captured or otherwise stored for later viewing or playback, it is often desirable to index, parse, or otherwise provide a capability to browse particular portions of the media stream. In order to efficiently access particular portions of a stored media stream, the media must be parsed or otherwise indexed or segmented into uniquely identifiable segments of content.
For example, a number of conventional schemes attempt to parse video content into “shots.” A shot is defined as a number of sequential image frames comprising an uninterrupted segment of a video sequence. In parsing the video into shots, conventional media processing systems attempt to identify shot boundaries by analyzing consecutive frames for deviations in content from one frame to another.
One scheme for determining a transition point between shots in a video sequence involves the use of color histogram based segmentation. For example, this scheme generates a color histogram for each of a number of consecutive frames. These histograms are then analyzed to detect significant deviation between frames. A deviation that exceeds a particular deviation threshold is determined to indicate a shot boundary. Unfortunately, while such methods are useful for identifying particular shot boundaries, they fail to identify related shots that, when taken together form a continuous segment of related video, such as, for example, a complete music video which is typically comprised of a large number of shots.
Another related scheme automatically indexes a broadcast television type news video by indexing particular shots or scenes within the video by considering a correspondence of image contents and semantic attributes of “keywords.” This scheme operates by first classifying shots or scenes based on graphical features of the shots, and then analyzing semantic attributes of accompanying text-type captions. Next, keywords derived from the accompanying text are selectively indexed to shots according to appropriate correspondence of typical shot classes and semantic attributes of keywords. However, while useful, this scheme is narrowly tailored to index news-type video broadcasts that include accompanying text captions. Consequently, such a scheme would likely perform poorly in other audio-video multimedia environments such as with music video type broadcasts.
Therefore, what is needed is a system and method for efficiently extracting or segmenting complete video objects from a media stream such as a broadcast television signal or streaming network broadcast by identifying the actual endpoints of each video object rather than merely identifying unique shots within the video stream. Further, such a system and method should be capable of extracting text information, when available for use in identifying, indexing, or cataloging each video object.