The invention relates to an automated method for temporal segmentation of a video into scenes with uniting frame sequences by detecting frame sequence boundaries as scene boundaries, wherein different types of transitions occur at the frame sequence boundaries and the frame sequence boundaries and transitions are known according to position, type and length.
The automated detection of scene boundaries and, thus, scenes serves to support the user in browsing a video, make the content available to him faster and better, allow to compile a kind of a contents directory and further (automated or manual) analyses and annotations. Scenes include a series of frame sequences, as a rule, shots (see below), and are characterized by an event taking place in one setting in a continuous period of time. In this context, the term “scenes” (or “logical story unit”) is used in fictional, narrative videos (e.g. features, sitcoms, animated cartoons). By analogy, in non-fictional videos (e.g. news, documentaries, educational films, music and sports programs), the term “topic unit” or “topic” is used. “Logical unit” (or “story unit”) is the umbrella term for “scenes” and “topic units”. However, this term is not consistently established among the experts, so that in the following, the term “scene” will be used with reference to all videos independently of their content. Furthermore, the terms “video” and “film” as well as “video frame” and “frame” will be compatibly with each other.
The segmentation of videos into scenes by uniting particular frame sequences is the subject of the present invention. Thus, it has to be detected which frame sequence boundaries are scene boundaries at the same time. As a rule, scenes are comprised of shots as particular frame sequences. However, a scene boundary may also lie within a shot. In a segmentation, however, the same is only detected if the shot is divided into individual sub-shots (as a frame sequence subordinate to a frame sequence shot) and if it is checked whether a sub-shot boundary is a scene boundary the same time. Thus, a scene boundary may also lie on a sub-shot boundary in the middle of a shot. However, no gradual transitions based on film grammar occur at sub-shot boundaries. In the following, sub-shot transitions will be correspondingly treated as transitions of the CUT type. Knowledge of the individual shots or sub-shots and their transitions according to position, length and type is a prerequisite for applying the method according to the invention. In the following, a detection of shot boundaries and associated shot transitions will be presumed. The method according to the invention may also be applied to sub-shot boundaries and sub-shot transitions if these are known according to temporal position, type and length by a corresponding detection method. In the following, the term “frame sequence” will be used both for “shot” and the term “sub-shot”.
A frame sequence boundary is established video-frame accurately, that is, it lies exactly between two successive video frames. “Transitions” are transitions between frame sequences. They correspond to the frame sequence boundary or include the same. Four different types of transitions are to be distinguished: the CUT type is a hard, abrupt cut. The CUT is established video-frame accurately and corresponds to the frame sequence boundary. In contrast to this abrupt transition, three gradual, continuous transitions exist between shots as frame sequences: the DISSOLVE type as a dissolve (a frame is slowly completely faded out, while another frame is faded in at the same speed): the FADE type as a fade-out or fade-in into a black frame, and the WIPE type as a wipe fade, wherein a frame is displaced from the screen by a new frame shifting into the screen to the same degree. Parts of the old frame may also be replaced one after the other with parts of the new frame. In this process, one or several boundaries traverse the screen (e.g. according to the principle of a “windscreen wiper”). The gradual transitions mentioned represent the most frequently occurring transitions used in film grammar. However, also other gradual transitions exist which by analogy are incorporated into the present invention.
In general, different methods for temporal segmentation of a video into scenes are known. One has to distinguish between methods for scene segmentation based on visual properties, in particular, the similarity between individual shots (similarity-based methods, e.g. overlapping links, clustering, time-adaptive grouping, graph-based); coherence methods (video coherence/decision curve), in which coherence values for particular points (often shot boundaries) as a measure for the consistency (coherence) of the video content before and after this point (minimum coherence values indicate scene boundaries), and rule- and model-based methods, which, however, play only an insignificant role in scene segmentation. In the following, methods for scene segmentation will be considered which are either similarity- or coherence-based.
For example, methods for finding shot boundaries with taking the shot transitions into account according to position, type and length are known from:    Publication I: C. Petersohn. “Fraunhofer HHI at TRECVID 2004: Shot Boundary Detection System”, TREC Video Retrieval Evaluation Online Proceedings, TRECVID, 2004,    Publication II: C. Petersohn. “Dissolve Shot Boundary Determination”, Proc. IEE European Workshop on the Integration of Knowledge, Semantics and Digital Media Technology, pp. 87-94, London, UK, 2004, and    Publication III: C. Petersohn. “Wipe Shot Boundary Determination”, Proc. IS&T/SPIE Electronic Imaging 2005, Storage and Retrieval Methods and Applications for Multimedia, pp. 337-346, San Jose, Calif., 2005.
A method for finding sub-shot boundaries and sub-shot transitions is known from publication IV: C. Petersohn. “Sub-Shots—Basic Units of Video”, Proc. IEEE International Conference on Systems, Signals and Image Processing, and EURASIP Conference Focused on Speech and Image Processing, Multimedia Communications and Services, Maribor, Slovenia, Jun. 27th to 29th, 2007.
Further, US 2003/0123541 A1, US 2001/0021276 A1, EP 1 021 041 B1, DE 60119012 T2 and DE 102 527 31 A1 specify methods for finding shot transitions. Thus, such methods are sufficiently known from the art and do not need to be further explained. It may be assumed that one skilled in the art will know the information about the temporal position of frame sequence boundaries and the type and the length of transitions at the frame sequence boundaries needed for performing the method according to the invention.
Furthermore, there are some methods for scene detection in the art, which, however, as a rule are based on the visual similarity of shots within scenes. Partly, also audio or text are analyzed. However, there are three publications in which different types of shot transitions are utilized for scene detection. From publication V by Truong, B. T., Venkatesh, S., Dorai, C. (2002) “Film grammar based refinements to extracting scenes in motion pictures” (IEEE International Conference on Multimedia and Expo. ICME, 02, Vol. 1, pp. 281-284) it is known to use all FADES and add their positions as scene boundaries to the list of the scene boundaries otherwise detected. As such, FADES are accepted exclusively with a separating effect. Additionally, other types of shot transitions are ignored, although in publication VI by Lu, X., Ma, Y.-F., Zhang, H.-J., and Wu, L. (2002) “An integrated correlation measure for semantic video segmentation” (Proc. IEEE International Conference on Multimedia and Expo, ICME'02, Lausanne, Switzerland) it is mentioned that WIPES might be taken into account in the segmentation with a separating effect (“disjunction”). However, here, too, not all types of shot transitions are considered and an exclusively separating effect is taken into account. The reliability of the different types of gradual transitions for the indication of scene boundaries is not examined in detail either. An integration occurs exclusively for the coherence-based method described in publication VI. In publication VII by Aigrain, P., Joly, P., Longueville, V. (1997) “Medium knowledge-based macro-segmentation of video into sequences” (in M. T. Maybury, ed., Intelligent Multimedia Information Retrieval. AAAI/MIT Press), two rules (transition effect rules 1 and 3) are defined with respect to scene boundaries. Rule 1: For each stand-alone gradual shot transition, a scene boundary is inserted. Rule 2: For each series of gradual shot transitions, a scene boundary is inserted before and after it, if needed. However, there is a problem in that in this context, each gradual transition is used as an equally reliable indicator for scene boundaries and gradual transitions, which were wrongly detected beforehand by the automated shot detection, are marked as scene boundaries. By these rigid rules, the method is not sufficiently robust.
Furthermore, there are different scene detection methods utilizing one or several video frames per shot to ascertain the similarity between two adjacent shots. If there is great similarity (in relation to a predetermined threshold value), this implies that both the shots belong to a common scene. The shot boundary between these two shots is not detected as a scene boundary. If there is no great similarity, the shot boundary is detected as a scene boundary. Both in publication VIII by Hanjalic, A., Lagendijk, R. L., Biemond, J. (1999): “Automated high-level movie segmentation for advanced video-retrieval systems” (IEEE Trans. on Circuits and Systems for Video Technology, 9(4):580-588) and in publication IX by Rui, Y., Huang, T. S., Mehrotra, S. (1999): “Constructing table-of-content for videos” (Multimedia Syst, 7(5):359-368), exactly two video frames are respectively utilized for a similarity analysis. In this context, basically the first and the last frame in a shot are chosen. Here, gradual shot transitions are not taken into account.
A further coherence-based method for scene detection is described in publication X by Truong, B. T., Venkatesh, S., and Dorai, C. (2002): “Neighborhood coherence and edge based approach for scene extraction in films” in (IEEE International Conference on Pattern Recognition (ICPR'2002), volume 2, pages 350-353, Quebec. Here, scene boundaries are inserted for all coherence values below a fixed threshold value.
However, the mentioned approaches for ascertaining shot boundaries as scene boundaries, on which the present invention relies as the closest known art, are based on rigid fixed rules and, in part, on threshold values and produce errors in the detection of scene boundaries as soon as the examined frame sequences are not subject to these rules or threshold values. Additionally, these approaches only insufficiently utilize the information available in a video for scene segmentation.