Video is a comprehensive source of information. The individual video pictures are however not subjected to any intrinsic order for example as predefined by the alphabet for letters. Therefore, different forms of description, for example by means of color or edge histograms or face descriptors, were used for identifying video pictures, such descriptors taking into consideration at a low level only one single aspect of the video picture which is very complex as regards its contents. Even higher-order descriptors take into consideration only the aspects of a video picture for which they were designed.
For video analysis and indexing, three main questions are basically to be clarified:                Which temporal units should be indexed?        How should the units be indexed?        Which index should be designated by which label?        
For representation, abstraction, navigation, search and retrieval of video contents, the automatic recognition of temporal video structure is an important prerequisite. The units can be indexed only after a temporal video segmentation. If the hierarchical structure of a video is used for this purpose, the search can occur in a way analogous to the search in a book or magazine with chapters, sections, paragraphs, sentences and words. Video can contain different modalities. Beside the visual content of the pictures, the auditory content (music, voice, noises, etc.) and an accompanying text can also be present. The lowest hierarchical level of the temporal structure of a video forms the only video picture for the visual content, one single scanning for the auditory content and a letter or a word for the text content. In the present application however, only the segmentation and retrieval of the visual content are addressed.
At a higher level, there are the so-called “shots”, which are defined as a sequence of successive video pictures taken by a single camera act without interruption. There are different algorithms detecting both abrupt and gradual shot boundaries. Shots adjacent in time form a “scene” (in the present application, this term is used for a higher-order unit in a finished film and not as a synonym for shots). A scene is defined by one single event, which takes place in a continuous setting during a continuous time period. Algorithms for the segmentation of a video into scenes often operate on the visual similarity between individual shots, because shots in the same setting often exhibit the same visual properties. At the highest level are the programs, for example feature films, newscasts, documentary films, series or home videos. Search algorithms for retrieving individual news stories or commercials fade-ins also already exist. Between the program level and the scene level is partly defined another “sequence level”. This term should not be confused with the term “sequence of video pictures” for example as the content of a shot.
Scenes are defined as events which take place within a continuous time period in a setting (scene: an event in a setting). Shots are continuously taken in a continuous time period by a camera and are comprised of a corresponding sequence of video pictures. By means of continuous camera panning, the setting and thus the scene in a shot can change, however. Since the event and the setting can change within a shot, each scene boundary is not also a shot boundary at the same time. In other words, a scene can also contain only parts of a shot. Hence, the picture content within a shot can also be very different. Therefore, several key-frames (key pictures) are presently also generally associated with a shot for the detection of the content. For the selection of the key-frames (key-frame selection), several algorithms are also well-known from the state of the art.
Until now, automatic temporal segmentation of a video or a video sequence normally has occurred in two steps. First, the shot boundaries are detected in order to obtain a segmentation at shot level. For the determined shots several key-frames are then usually selected for a characteristic display of the picture content. Then, in a third process step, groups of adjacent shots are grouped into scenes, in order to obtain a segmentation at scene level. The grouping of adjacent shots into scenes is however only an approximation, since scene boundaries within shots cannot be taken into consideration in the segmentation.
From the state of the art, one has known algorithms for identifying and defining scenes which are generally based on the detection of shot boundaries. Some documents, however, also deal with the use of sub-shots in detection.
In U.S. Pat. No. 7,127,120 B2 is described a system, which reads video and audio material and automatically generates a summary in the form of a new video. To this end, beside shots, sub-shots are also extracted as temporal video segments in the original video, in order not to have to use complete shots in the summary. The aim of the sub-shot extraction is here neither the discovery of reasonable key-frames for shots nor the discovery of potential scene boundaries within a shot, but the definition of segments, which can later on be used in the video summary. The method described by way of an example for sub-shot extraction is based on the analysis of picture differences between two adjacent pictures (Frame Difference Curve, FDC). Long-term changes, which can result into both strong oversegmentation and undersegmentation, are however ignored. Therefore, the sub-shots used in U.S. Pat. No. 7,127,120 B2 are not suitable for discovering as few key-frames as possible, but representative key-frames, for shots (key-frame selection problem). In addition, it is not guaranteed that scene boundaries within shots are also sub-shot boundaries in the meaning of U.S. Pat. No. 7,127,120 B2. A definition of scenes based on sub-shots is not provided.
From publication I of H. Sundaram et al.: “Determining Computable Scenes in Film and their Structures using Audio-Visual Memory Models” (Proc. 8th ACM Multimedia Conf., Los Angeles, Calif., 2000) it is known to identify scenes by means of so-called “shotlets”. The shots are, irrespectively of their visual content, simply divided into sections (“shotlets”) of a length of one second. This is a very simple approach with low additional complexity for the algorithm for discovering the sub-shot boundaries. Scene boundaries within shots can thus be found automatically. Because of the rigid rule for forming the shotlets, a strong oversegmentation is however obtained in visually static shots with only slightly changing video material. Therefore, no solution for the problem of association of relevant key-frames to shotlets is provided in Publication I.
Furthermore, from publication II of K. Hoashi et al. (KDDI Laboratories): “Shot Boundary Determination on MPEG Compressed Domain and Story Segmentation Experiments for TRECVID 2004” (TREC Video Retrieval Evaluation Online Proceedings, TRECVID, 2004) it is known to set, during the presentation of the news cast, additional points for potential scene boundaries in shots when the speech of the newscaster is interrupted. This kind of further subdivision works however only for newscasts or similarly construed broadcasts. In publication III of S. Treetasanatavorn et al.: “Temporal Video Segmentation Using Global Motion Estimation and Discrete Curve Evolution” (Proc. IEEE ICIP, International Conference on Image Processing, pages 385-388, Singapore, October 2004) are defined segments of shots, which are based on the consistency of motion between adjacent video pictures. These motion segments are used to describe the camera motion within a shot. In a prior publication IV of J. Maeda: “Method for extracting camera operations in order to describe sub-scenes in video sequences” (Proc. SPIE—volume 2187, Digital Video Compression on Personal Computers: Algorithms and Technologies, pages 56-67, May 1994), similar segments are defined and referred to as “sub-scenes”. Such a segmentation serves for describing a motion. It is however not suited for describing segments with different visual content.
In the state of the art, the problems of a suitable scene recognition and definition, which is not based on shot boundaries, and of a suitable key-frame selection for the segments found are so far addressed separately. Generally, one or several key-frames per shot are most often selected based on the segmentation results of the shot detection.
In publication V of Zhuang et al.: “Adaptive key frame extraction using unsupervised clustering” in Proc. IEEE Int. Conf. Image Processing pp. 866-870, Chicago, Ill., 1998) it is provided to cluster all the pictures of a shot and to then use the cluster centers as key-frames. Temporal segments are however not defined. On the contrary, the clusters do not have to correspond with temporal segments. An example would be a shot in which the camera pans from person A to person B and then again to person A. For this shot comprised of three segments, two clusters would be formed, one cluster for person A, one cluster for person B, and two key-frames would be selected accordingly. Temporal connections between the key-frames of a shot are thus lost here. In publication VI of J. Kang et al.: “An Effective Method For Video Segmentation And Sub-Shot Characterization” ICASSP 2002, Proc. Vol. 4, pages IV-3652-3655), sub-shots are also referred to as parts of a shot with homogeneous motion properties (e.g. a coherent zoom or pan). The motion in a shot can thus be described per segment.
In publication VII of Zhang et al.: “An integrated system for content-based video retrieval and browsing” (Pattern Recognition, vol. 30, no. 4, pages 643-658, 1997) is presented an approach for selecting key-frames for a shot. It is observed that a shot cannot be reasonably represented by only one key-frame. Therefore, besides the first picture of a shot, further pictures are possibly selected as key-frames. This occurs by means of two methods (a simple comparison of the visual similarity between individual pictures and global motion analysis for determining camera motions). The key-frame selection is thus indeed performed depending on the video content, but no segments (sub-shots) are defined, which could subsequently be used for scene definition. In addition, at least the first picture of a shot is used as a key-frame. This picture is however unfavorable in particular with gradual shot transitions (fade-out, fade-in, cross-fade, wipe), since it poorly represents the shot.
A method for the key-frame selection also for sub-shots is known from US 2004/0012623 and is regarded as the closest state of the art for the invention. Sensors within the video camera provide data on the rotation of the camera in the x- and/or y-direction. In addition, the use of the zoom switch is logged by the user on the video camera. Based on these data regarding the camera motion, the starting points of pivotal (Pan) and zoom motions of the video camera are defined as sub-shot boundaries within shots. Key-frames are then determined for the sub-shots depending on the camera motion data. A prerequisite for the method is thus that the camera motion data are logged by the camera for the video sequence and are available. Furthermore, exclusively camera motions result into additional sub-shots and thus into additional key-frames. Local motions of objects or other effects in already edited videos are ignored. Thus, not all relevant changes are taken into consideration for the key-frame selection.