1. Field of the Invention
The present invention relates generally to audio signal processing, and more particularly to extracting characteristic fingerprints from audio signals and to searching a database of such fingerprints.
2. Background of the Invention
Because of the variations in file formats, compression technologies, and other methods of representing data, the problem of identifying a data signal or comparing it to others raises significant technical difficulties. For example, in the case of digital music files on a computer, there are many formats for encoding and compressing the songs. In addition, the songs are often sampled into digital form at different data rates and have different characteristics (e.g., different waveforms). Recorded analog audio also contains noise and distortions. These significant waveform differences make direct comparison of such files a poor choice for efficient file or signal recognition or comparison. Direct file comparison also does not allow comparison of media encoded in different formats (e.g., comparing the same song encoded in MP3 and WAV).
For these reasons, identifying and tracking media and other content, such as that distributed over the Internet, is often done by attaching metadata, watermarks, or some other code that contains identification information for the media. But this attached information is often incomplete, incorrect, or both. For example, metadata is rarely complete, and filenames are even more rarely uniform. In addition, approaches such as watermarking are invasive, altering the original file with the added data or code. Another drawback of these approaches is that they are vulnerable to tampering. Even if every media file were to include accurate identification data such as metadata or a watermark, the files could be “unlocked” (and thus pirated) if the information were successfully removed.
To avoid these problems, other methods have been developed based on the concept of analyzing the content of a data signal itself. In one class of methods, an audio fingerprint is generated for a segment of audio, where the fingerprint contains characteristic information about the audio that can be used to identify the original audio. In one example, an audio fingerprint comprises a digital sequence that identifies a fragment of audio. The process of generating an audio fingerprint is often based on acoustical and perceptual properties of the audio for which the fingerprint is being generated. Audio fingerprints typically have a much smaller size than the original audio content and thus may be used as a convenient tool to identify, compare, and search for audio content. Audio fingerprinting can be used in a wide variety of applications, including broadcast monitoring, audio content organization, filtering of content of P2P networks, and identification of songs or other audio content. As applied to these various areas, audio fingerprinting typically involves fingerprint extraction as well as fingerprint database searching algorithms.
Most existing fingerprinting techniques are based on extracting audio features from an audio sample in the frequency domain. The audio is first segmented into frames, and for every frame a set of features is computed. Among the audio features that can be used are Fast Fourier Transform (FFT) coefficients, Mel Frequency Cepstral Coefficients (MFCC), spectral flatness, sharpness, entropy, and modulation frequency. The computed features are assembled into a feature vector, which is usually transformed using derivatives, means, or variances. The feature vector is mapped into a more compact representation using algorithms such as Principal Component Analysis, followed by quantization, to produce the audio fingerprint. Usually, a fingerprint obtained by processing a single audio frame has a relatively small size and may not be sufficiently unique to identify the original audio sequence with the desired degree of reliability. To enhance fingerprint uniqueness and thus increase the probability of correct recognition (and decrease false positive rate), small sub fingerprints can be combined into larger blocks representing about three to five seconds of audio.
One fingerprinting technique, developed by Philips, uses a short-time Fourier Transform (STFT) to extract a 32-bit sub-fingerprint for every interval of 11.8 milliseconds of an audio signal. The audio signal is first segmented into overlapping frames 0.37 seconds long, and the frames are weighed by a Hamming window with an overlap factor of 31/32 and transformed into the frequency domain using a FFT. The frequency domain data obtained may be presented as a spectrogram (e.g., a time-frequency diagram), with time on the horizontal axis and frequency on the vertical axis. The spectrum of every frame (spectrogram column) is segmented into 33 non-overlapping frequency bands in the range of 300 Hz to 2000 Hz, with logarithmic spacing. The spectral energy in every band is calculated, and a 32-bit sub-fingerprint is generated using the sign of the energy difference in consecutive bands along the time and frequency axes. If the energy difference between two bands in one frame is larger than energy difference between the same bands in the previous frame, the algorithm outputs “1” for the corresponding bit in the sub-fingerprint; otherwise, it outputs “0” for the corresponding bit. A fingerprint is assembled by combining 256 subsequent 32-bit sub-fingerprints into single fingerprint block, which corresponds to three seconds of audio.
Although designed to be robust against common types of audio processing, noise, and distortions, this algorithm is not very robust against large speed changes because of the resulting spectrum scaling. Accordingly, a modified algorithm was proposed in which audio fingerprints are extracted in the scale-invariant Fourier-Mellin domain. The modified algorithm includes additional steps performed after transforming the audio frames into the frequency domain. These additional steps include spectrum log-mapping followed by a second Fourier transform. For every frame, therefore, a first FFT is applied, the result is log-mapped to obtain a power spectrum, and a second FFT is applied. This can be described as the Fourier transform of the logarithmically resampled Fourier transform, and it is similar to well known MFCC methods widely used in speech recognition. The main difference is that Fourier-Mellin transform uses log-mapping of whole spectrum, while MFCC is based on the mel-frequency scale (linear up to 1 kHz and has log spacing for higher frequencies, mimicking the properties of the human auditory system).
The Philips algorithm falls into a category of so-called short-term analysis algorithms because the sub-fingerprints are calculated using spectral coefficients of just two consecutive frames. There are other algorithms that extract spectral features using multiple overlapped FFT frames in the spectrogram. Some of the methods based on evaluation of multiple frames in time are known as long-term spectrogram analysis algorithms.
One long-term analysis algorithm, described for example in Sukittanon, “Modulation-Scale Analysis for Content Identification,” IEEE Transactions on Signal Processing, vol. 52, no. 10 (October 2004), is based on the estimation of modulation frequencies. In this algorithm, the audio is segmented and a spectrogram is computed for it. A modulation spectrum is then calculated for each spectrogram band (e.g., a range of frequencies in the spectrogram) by applying a second transform along the temporal row (e.g., the horizontal axis) of the spectrogram. This is different from the modified Philips approach, in which the second FFT is applied along the frequency column of the spectrogram (e.g., the vertical axis). In this approach, the spectrogram is segmented into N frequency bands, and the same number N of continuous wavelet transforms (CWT) are calculated, one for each band.
Although the developers of this algorithm claim superior performance compared to the Philips algorithm, existing algorithms still exhibit a number of deficiencies. For example, the algorithms may not be sufficiently robust to identify distorted speech and music reliably, especially when the audio is compressed using a CELP audio codec (e.g., associated with cell phone audio, such as GSM). Moreover, these algorithms are generally sensitive to noise and analog distortions, such as those associated with a microphone recording. And even if the algorithms can identify audio in presence of single type of distortion, they may not be able to handle a combination of multiple distortions, which is more common and closer to a real world scenario (e.g., as with a cell phone, audio recorded from a microphone in a noisy room with light reverberation followed by GSM compression).
When applied to practical applications, therefore, existing fingerprinting schemes have unacceptably high error rates (e.g., false positives and false negatives), produce fingerprints that are too large to be commercially viable, and/or are too slow. Accordingly, there exists a need to overcome existing limitations that current audio recognition techniques have failed to solve.