The field is that of index construction for intra-video interactive navigation or that of video structuring, that is to say the ordering of the shots which makes it possible to define a table of contents.
A video sequence is composed of shots, each shot corresponding to picture breaks, which may themselves be grouped into scenes. Structuring involves a step of classifying the shots. The latter step is of course possible only on condition that the content of the video can be structured, for example in fields such as sport, televised news, interviews, etc.
Usually, the classes are defined beforehand by supervised or unsupervised learning procedures, then the candidate shots of a video are attached to one of the classes, on the basis of a similarity measure.
The classification of the shots is a fundamental step of the video structuring process. Numerous procedures for classifying and representing shots are proposed, but few concern themselves with the identification of these classes.
For example, in a video summary creation context, the article by S. Uchihashi, J. Foote, A. Girgensohn, J. Boreczsky entitled “Vidéo Manga: Generating Semantically Meaningful Video Summaries”, Proc. ACM Multimedia, Orlando, Fla., pp 383-392, November 1999, describes a hierarchical grouping procedure within the step of structuring the sequence. The result is represented in the form of a tree. On initialization, each image of the video is assigned to a class or cluster. Then similar images are grouped together by iteratively merging the two closest structures at each step. At the root, one finds the maximum cluster containing the set of images. Henceforth the desired number of clusters is selected by specifying the distance of the merged clusters from their parent. By this procedure, similar shots are grouped together, but no information regarding the nature of the shots is found.
On the other hand for the structuring of televised news, as described in the article by H. J Zhang, S. Y. Tan, S. W. Smoliar, G. Yihong entitled “Automatic parsing and indexing of news video”, Multimedia Systems, 2(6):256-265, 1995, one seeks to distinguish two types of shots: those concerning the presenter and those relating to reporter footage. The shots of the presenter are identified with the aid of spatial characteristics: typically, a person in the foreground and an inlay in the top right or left. The first step consists in defining a model A of the image representative of a shot of the presenter. In the second step, the shots are labelled as belonging to A or otherwise, with the aid of a measure of similarity using local descriptors, the key image previously being segmented into regions. In this procedure, a shot of interest is modelled first, then all the shots which come close to this model are selected.
Another application of the selection of shots of interest is to identify the shots concerning the interviewee and those of the interviewer in a video of an interview. In this approach, for example described in the article by O. Javed, S. Khan, Z. Rasheed, M. Shah entitled “A Framework for Segmentation of Interview Videos”, IASTED Intl. Conf. Internet and Multimedia Systems and Applications, Las Vegas, November, 2000, one is more interested in the information carried by the transitions between shots, coupled with the knowledge of the structure of an interview video, alternate shots of the interviewer and of the interviewee, than in the analysis of the content of the scene. However, a skin detection algorithm is used to determine the number of people in the image. Since the questions are typically shorter than the answers, the assumption used is that the shots of the interviewer are among the shortest. The key images of the N shortest shots containing just one person are correlated to find the most repetitive shot. One thus obtains an N×N correlation matrix whose rows are summed. The key image corresponding to the maximum sum is then identified as the key image of the interviewer. It is again correlated with all the other images to find all the shots concerned therewith.
FIG. 1 represents, in a known manner, a general scheme of the construction of video summaries. In a first step referenced 1, the sequence of images is split into shots, the shots corresponding to picture breaks. For each shot, one or more characteristic images are selected, these being key images. This is the object of step 2. For each key image, a signature is calculated in step 3, using local descriptors or attributes, for example colour, contours, texture, etc. Step 4 performs a selection of shots of interest as a function of these signatures or attributes and a summary is made in step 5 on the basis of these shots of interest.
The nature of the shots of interest varies as a function of the intended application. For example, for televised news, it may involve the presenter. These shots of interest often correspond to the prevalent shots, that is to say to a dominant picture. Specifically, in certain sequences, in particular sports sequences, the most interesting moments are characterized by a common and repetitive picture in the course of the sequence, for example during a football, tennis, baseball match, etc.
The invention is more particularly related to the step of selecting the shots of interest. The procedure proposed is based on the signature of each key image, associated with a metric, so as to determine in a binary manner whether or not the shots belong to the class of shots of interest.
Relating to partitioning or “clustering”, numerous algorithms exist. Found to be among the most used is the K-means based on the calculation of the barycentre of the attributes or its variant the K-medoid which takes into account the physical point, that is to say the image closest to the barycentre, which are iterative algorithms. From an initial partition, the K-means or K-medoid group the data together into a fixed number of classes. This grouping is very sensitive to the initial partition. Moreover, it requires the a priori fixing of the number of classes, that is to say a priori knowledge of the content of the video. In the converse case, it does not guarantee the obtaining of an optimal partitioning of the video sequence processed.