The invention relates to a method and a device for clustering key images by using spatial-temporal attributes.
Clustering is aimed at grouping data by utilizing measurements of distances or similarities between them, data that are not very distant being clustered within one and the same class. An application pertaining to digital video data relates to the automatic construction of video summaries.
FIG. 1 represents a general scheme of a method of constructing a video summary of an image sequence.
In a first step 1, the video sequence is chopped into video shots. A second step 2 extracts characteristic images or key images from the various shots of the sequence. For each key image, this is step 3, a signature is calculated, for example on the basis of attributes of the image such as colour, structure, etc. The next step 4 is aimed at clustering the various shots which resemble one another into clusters of shots. A measurement of similarity is thus performed, on the basis of any calculation of distances between the signatures of the key images associated with each shot. The next step 5 constructs the summary by extracting a key image per cluster of shots.
This method makes it possible to reduce the number of characteristic shots, for example during video sequences with alternating shots, so as to create video summaries, indices, etc.
A known approach tackling the problems of clustering of shots for the construction of video summaries is that proposed by Yeung and Yeo in the document “Segmentation of video by clustering and graph analysis”, Computer Vision and Image Understanding, vol 71, no 1, July, pp 94-109, 1998. With each pair of shots is associated a distance which is a measure regarding the differences between signatures. In addition to the distance between signatures, the procedure proposed contrives not to gather together shots whose temporal distance is greater than a temporal threshold T. The underlying idea rests upon the assumption that shots belonging to one and the same semantic unit cannot be very distant. This assumption also has the advantage of limiting the number of potential clusterings and thus of limiting the calculation cost. The clustering algorithm used operates on successive clusterings of shots commencing with the most similar until the distances are all greater than a threshold.
In this procedure, if two shots are separated by more than T images, the clustering is not possible. In the article cited, T is fixed at a value of the order of a few thousand images. The main problem of this procedure resides in the fact that this threshold is fixed, that it is of significant importance to the final result and that it is therefore difficult to fix a priori. For example, if a dialogue scene lasts more than 3000 images and if this value exceeds the temporal threshold, overchopping occurs. All the shots thus clustered must be pairwise close both visually and temporally. The size of the clusters thus generated is therefore relatively limited. The final result of this clustering algorithm is characterized by the obtaining of clusters that are relatively uniform in terms of number of shots, this number generally being small and in any event limited by the value of the threshold T.