With movies, television, music and other audio and video being almost ubiquitous in today's society, there is a growing desire to be able to identify such content automatically. Automatically identifying content opens various possibilities, such as acquiring metadata such as title, artist, genre, lyrics, reviews, ratings and so on for such content, or providing additional content or activities to accompany the content. Another attractive application is broadcast monitoring: identifying broadcasts and compiling lists for use in e.g. determining royalty payouts to copyright holders.
One technique for obtaining an identifier for content is called fingerprinting, sometimes also referred to as signature creation, robust fingerprinting, robust hashing or feature extraction. A (robust) fingerprint of a content item is a representation of the most relevant perceptual features of the item.
Generally speaking, fingerprinting algorithms have two performance criteria: discrimination and robustness. A discriminative fingerprinting algorithm allows differentiating two information signals from each other. That is, it should be statistically unlikely to get two similar fingerprints from two dissimilar signals. A robust fingerprinting algorithm allows for identifying the same information signal with various distortions. That is, the fingerprints computed from two distorted versions of the same signal should be the same or at least very similar to each other. Distortions may be accidental or intentional: from low-quality radio broadcasts of music to cropping or resizing of movies or the adding of subtitles, overlays or watermarks.
Many schemes for identification and classification of information signals using fingerprinting have been proposed. Some examples are disclosed in U.S. Pat. Nos. 8,140,331B2, 8,380,518B2, 7,516,074B2, 8,440,900B2 and 8,492,633B2.
U.S. Pat. No. 8,204,314 discloses a method of generating a spatial signature or fingerprint for a frame of a video object. The frame is divided into plural blocks. For each block the mean luminance is calculated, and the relative ordering of blocks by luminance is transformed into a vector that is one of multiple inputs for the spatial signature or fingerprint. This process is performed at multiple levels by creating a more fine-grained division of blocks: first 2×2 blocks, then 4×4 blocks, and so on. A disadvantage of this method is that comparing all mean luminances against each other to create a relative ordering is slow. In addition, there is significant correlation between blocks, which reduces the robustness of the algorithm.
U.S. Pat. No. 8,340,449 discloses a method of calculating fingerprints for videos based on their spatial and sequential characteristics. Pairs of adjacent pixels form the lowest-level values. Sums or differences or pairs are taken as a higher-level value. This process is repeated for each row, column, and time column in the video segment. The result is a three-dimensional array of coefficients that represents the spatial and sequential characteristics of all frames in the segment, which array is subsequently quantized, for example by comparing the magnitude of each coefficient to a predetermined threshold value. This flattens the array to a one-dimensional bit vector. In an example each bit is quantized to +1, −1, or 0 and a two-bit encoding scheme uses the bits 10 for +1, 01 for −1, and 00 for zero. The bit vector forms the fingerprint.
A disadvantage of this method is that computed frequency differences are correlated, which means the resulting fingerprints are not fully discriminative. Further, the calculation process is slow because of the complex calculations involved.
International patent application WO 02/065782 by Haitsma et al. discloses a method of generating a robust hash identifying an information signal comprising audio or audiovisual content such as a movie, television program or song. The method divides the information signal into frames, computes a hash word for each frame, and concatenates successive hash words to constitute the hash signal. Computing the hash word comprises subdividing each frame of the information signal into plural frequency sub bands, calculating a spectral property of the signal in each of said frequency sub bands, comparing the properties in the frequency sub bands with respective thresholds and representing the results of said comparisons by respective bits of the hash word.
FIG. 1 illustrates an embodiment of the Haitsma algorithm employing a 33×N spectrogram image with 33 frequency sub bands F on the y-axis and N frames on the x-axis. A 32 bits fingerprint is extracted at each frame based on a filtering technique. The energy difference between subsequent frames in lime and subsequent frequency sub bands in frequency is computed and compared with a threshold. A “1” bit corresponds to a positive difference value while a “0” bit to a non-positive value. If we denote the energy of frequency band m at frame n as E(n,m), and the m-th bit of the fingerprint of frame n by B(n,m), the bits of a fingerprint can be expressed with the following formula:
            F      ⁡              (                  n          ,          m                )              =                  E        ⁡                  (                      n            ,            m                    )                    +              E        ⁡                  (                                    n              -              1                        ,            m                    )                    -              E        ⁡                  (                      n            ,                          m              +              1                                )                    -              E        ⁡                  (                                    n              -              1                        ,            m                    )                                B      ⁡              (                  n          ,          m                )              =          {                                                  1              ,                                                                          F                ⁡                                  (                                      n                    ,                    m                                    )                                            >              0                                                                          0              ,                                                                          F                ⁡                                  (                                      n                    ,                    m                                    )                                            ≤              0                                          }      
In this way, a 32-bit fingerprint can be generated from 33 frequency sub bands for each frame. The Haitsma disclosure combines 256 of these frame fingerprints into a block fingerprint, and performs a search based on this block fingerprint.
A disadvantage of the Haitsma algorithm is that the computed frequency differences are correlated, thus the resulting fingerprints are not fully discriminative. Even when it is assumed that the input frequency sub bands are uncorrected, the filtering (difference operation) introduces some correlation between filtered values.
Further, in the Haitsma algorithm, the difference between individual frequency sub bands is susceptible to small changes that impact one or more frequency sub bands. In general, differences computed from larger frequency ranges (e.g. over multiple frequency sub bands) are more robust against noise introduced by audio or video processing.