To obtain a characteristic representative of the visual rhythm of a video, known methods use local semantic properties. These properties include, for example, the identification of shots, scenes or movements of objects, embedding of text, and recognition of shapes or faces.
The main tasks of video indexing rely either on a description of the whole of the document or on detecting breaks or discontinuities in the film. These discontinuities can relate to movement, color, etc. That approach imposes running through the entire film on the look out for these discontinuities, which is costly and very time-consuming.
Another aspect of the usual techniques that is very costly in terms of resources is identifying and choosing key images in each shot. The key images are defined as the most significant images.
Accordingly, those techniques seek to identify object trajectories, text, functions of characters via their costume, face recognition, movements of the human body, etc. For example, a video is identified as relating to sport if movements of balls or players are captured. Relevant to those techniques are local descriptors that have a meaning (ball, players, etc.).
In contrast to those techniques there are also fast, statistically based methods such as macrosegmentation that take into account the statistical characteristics of the signal, for example audio or video data of very low level.
For example, the document concerning analysis of the rhythm of a film by B. Ionescu et al. entitled “Analyse et caractérisation de séquences de films d′animation” [Analysis and characterization of animated film sequences] (Orasis 2005) describes a technique that calculates mean values and shifts between different shots of a film. However, that method uses shots spaced by several tens of seconds and detection in that situation requires a very complex algorithm.