In some forms of video content, the video may be composed of individual video frames that may be grouped into a number of shots. In some examples, a shot may be characterized as a sequence of frames that are captured with a certain visual angle of a camera. A scene may be characterized as a collection of shots that may be related in action, place, context, and/or time, with such relationship perhaps corresponding to the nature of the content or program. For example, in some examples of situation comedies, soap operas, and/or dramatic programs, a scene may be characterized as a continuous set of shots that capture a certain action taking place in a particular location.
While watching or browsing video content, a user may desire to access a particular scene or portion of the content related to a scene. One approach to locating scenes within video content may involve grouping individual frames into shots by detecting shot boundaries at shot transitions. Hard cut shot transitions, in which the first frame of an appearing shot immediately follows the last frame of a disappearing shot, may be located by detecting differences in consecutive frames. On the other hand, gradual shot transitions typically span multiple frames over which the disappearing shot gradually transitions to the appearing shot. Within a gradual shot transition, temporally adjacent frames may be a combination of the disappearing shot and the appearing shot. As such, a gradual shot transition may include smaller and nonlinear differences between consecutive frames, making it more challenging to accurately identify a shot boundary.
Once shots are identified, the shots may be clustered into scenes. Algorithms that use K-mean clustering to cluster shots into scenes are known. These algorithms, however, typically depend upon an estimation of the number of expected clusters. As such, these approaches are highly sensitive to a correct estimation of the number of expected clusters. The corresponding algorithms are also relatively complicated and computationally expensive. Furthermore, while the correlation among individual frames that constitute a shot may be fairly reliable, the correlation among shots that comprise a scene may be more unpredictable, and may depend on the angle of the camera, the nature of the scene, and/or other factors. Accordingly, it can prove challenging to reliably and repeatedly identify scenes.