In videos (and movies), shot and scene boundaries provide a structure that can be useful for understanding, organizing, and browsing the videos.
A shot boundary occurs when the shutter opens, and another shot boundary occurs when the shutter closes. Thus, a shot is a continuous, uninterrupted sequence of frames. Generally, shots for drama, action, and situation comedies are in the order of a few seconds.
As defined herein, a scene is a semantically meaningful or cohesive sequence of frames. Scenes generally last several minutes. For example, a common scene includes actors talking to each other. The camera(s) usually present the scene as several close-up shots, where each actor is shown in turn, either listening or talking, and occasionally a shot will show all actors in the scene at a middle or far distance.
Detecting scene boundaries is challenging because scene boundaries for different genres, and even scene boundaries within one genre, do not necessarily have any obvious similarities.
Scene boundaries in scripted and unscripted videos can be detected by low-level visual features, such as image differences and motion vectors, as well as differences in distributions of audio features. Usually, after a feature extraction step, a comparison with a set threshold is required, see Jiang et al., “Video segmentation with the support of audio segmentation and classification,” Proc. IEEE ICME, 2000, Lu et al., “Video summarization by video structure analysis and graph optimization,” Proc. IEEE ICME, 2004, Sundaram et al., “Video scene segmentation using video and audio features,” Proc. IEEE ICME, 2000, and Sundaram et al., “Audio scene segmentation using multiple models, features and time scales,” IEEE ICASSP, 2000. All of the above techniques are genre specific. This means the detector is trained for a particular genre of video, and will not work for other genres. It is desired to provide a scene detector that will work for any genre of video.
Detecting semantic scene boundaries is challenging due to several factors including: lack of training data; difficulty in defining scene boundaries across diverse genres; absence of a systematic method to characterize and compare performance of different features; and difficulty in determining thresholds in hand-tuned systems.