Methods are known for the extraction of compact descriptors from still images, which methods include filtering local interest point descriptors, aggregating them to global descriptors and compressing descriptors by means such as dimensionality reductions and binarisation. Examples of such methods are:    Fisher Vectors, as described by: F. Perronnin and C. Dance: Fisher kernels on visual vocabularies for image categorization. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1-8, June 2007;    Scalable Compressed Fisher Vectors (SCFV), as described by: J. Lin, L.-Y. Duan, T. Huang, and W. Gao: Robust Fisher codes for large scale image retrieval. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1513-1517, May 2013;    VLAD and its improvements, as described by: H. Jegou, M. Douze, C. Schmid, and P. Perez: Aggregating local descriptors into a compact image representation. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3304-3311, June 2010; and R. Arandjelovic and A. Zisserman: All about VLAD. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1578-1585, June 2013;    VLAT, as described by: D. Picard and P.-H. Gosselin: Improving image similarity with vectors of locally aggregated tensors. In IEEE International Conference on Image Processing, Brussels, BE, September 2011;    CDVS, which are defined in ISO/IEC 15938-13, Information technology—multimedia content description interface—Part 13: Compact descriptors for visual search, 2014.    WO 2015/011185 A1, describing ALP (“A Low-degree Polynomial”), a method for detecting interesting points in an image,    WO 2013/102574 A1, describing a method for extracting, representing and matching local and global descriptors of still images,    WO 2010/055399 A1, a method and apparatus for representing and identifying feature descriptors utilizing a compressed histogram of gradients,    WO 2013/076365 A1, describing a method for detecting interest points as minima and/or maxima of filtered images and extracting descriptors for these interest points, and    U.S. Pat. No. 9,131,163 describes a method for coding and compressing 3D surface descriptors.
All of the mentioned methods address compact representation of descriptors of still images, but do not make of use the temporal redundancy of descriptors extracted from an image sequence in order to achieve better compression and reduce the computational complexity of comparing descriptors of two image sequences.
For video data, EP 1147655 B1 describes a system and method for the description of videos based on contained objects and their relations. While being able to describe video content in a semantic form, the method cannot be applied for efficient visual matching, where extraction of actual objects cannot be applied due to complexity and computational costs.
WO 2009/129243 A1 describes methods and systems for representation and matching of video content. The method aims at spatially and temporally aligning segments of video rather than determining a numeric value of their similarity. While performing selection of features, the method of WO 2009/129243 A1 does not encode the features in a compressed way. In addition, time and space coordinates are discarded, thus not allowing for spatiotemporal localisation.
A common problem in applications processing and managing image sequences (e.g., video databases) is to determine the similarity of image sequences based on the visual similarity of foreground or background objects visible in all or a temporal segment of the image sequence. Generally, the analysis of image sequences is significantly different from video copy detection, for which a number of approaches exist (e.g. in U.S. Pat. No. 7,532,804), and will require a different approach. Moreover, additional intricacies may arise in this context in cases where the objects used for determining similarity are visible only in a spatial, temporal or spatiotemporal segment of the image sequence, where objects are depicted from different views and under different conditions, and/or where image formation and processing may have been different. Therefore, it is one objective of the invention to provide a way for analyzing and describing an image sequence, in particular video sequences, by a descriptor type which is compact and allows matching of two descriptors with little computational complexity, while being applicable to image sequences regardless of the type of encoding and bitrates.