The described embodiments relate generally to video processing, and more particularly to entity based temporal segmentation of video streams.
The sharing of videos with a variety of different content and encoded in different formats through hosting services such as YOUTUBE represents a growing need for effective organization, indexing and management. Most of the existing solutions for video browsing and retrieval are shot-based, where a video stream is temporally segmented into shots. A shot of a video stream is an unbroken sequence of video frames of the video stream taken from one camera; two temporally adjacent segments using shot-based temporal segmentation are visually different.
There exist many multimedia applications that are directed to the semantics of video scenes than to temporal visual differences between adjacent shots. One challenge in shot-based temporal segmentation is to link the raw low level video data with high level semantic fields of a video stream, e.g., finding appropriate representations for the visual content which reflects the semantics of the video. Taking the contiguous shot of an aircraft flying towards a runway and landing as an example, on the semantic level, the contiguous shot includes two scenes: one describing the aircraft flying and the other about the aircraft landing. A shot-based segmentation may not differentiate between the two scenes if the transition between the two scenes is smooth.