A fingerprint (in literature also referred to as signature or hash) is a digital summary of an information signal. In cryptography, hashes have been used for a long time to verify correct reception of large files. Recently, the concept of hashing has been introduced to identify multi-media content. Unknown content such as an audio or video clip is recognized by comparing a fingerprint extracted from said clip with a collection of fingerprints stored in a database. In contrast with a cryptographic hash, which is extremely fragile (flipping a single bit in the large file will result in a completely different hash), a fingerprint extracted from audio-visual content must be robust. To a large extent, it must be invariant to processing such as compression or decompression, A/D or D/A conversion.
A prior-art fingerprinting system is disclosed in Haitsma et al.: Robust Hashing for Content Identification, published at the Content-Based Multimedia Indexing (CBMI) conference in Brescia (Italy), 2001. As described in this article, the fingerprint is derived from a perceptually essential property of the content, viz. the distribution of energy in bands of the audio frequency spectrum. For video signals, the distribution of luminance levels in video images has been proposed to constitute the basis for a robust fingerprint.
A fingerprint is created by dividing the signal into a series of (possibly overlapping) frames, and extracting a hash word representing the perceptually essential property of the signal within each frame to obtain a respective series of hash words. In order to identify an unknown clip, the database receives the series of hash words concerned, and searches the most similar stored series of hash words. Similarity is measured by determining how many bits of the series match a series of hash words in the database. If the BER (Bit Error Rate, the percentage of the non-matching bits) is below a certain threshold, the clip is identified as the song or movie from which the most similar series of hash words in the database originates.
A problem of the prior-art fingerprinting method is the size of the database. In the Haitsma et al. article, the audio signal is divided into frames of 0.4 seconds with an overlap of 31/32. This results in a new frame every 11.6 ms (=0.4/32). For every frame, a 32-bit hash word is extracted. Accordingly, a 5-minute song needs approximately 100 kbytes, viz. 5 (minutes)×60 (seconds)×4 (bytes per hash word)/0.0116 (seconds per hash word). Needless to say that the database must have a huge capacity to allow recognition of a large repertoire of songs. Similar considerations apply to video fingerprinting systems.