Identification of multimedia contents, and audio contents in particular, is a field that attracts a lot of attention because it is an enabling technology for many applications, ranging from copyright enforcement or searching in multimedia databases to metadata linking, audio and video synchronization, and the provision of many other added value services. Many of such applications rely on the comparison of an audio content captured by a microphone to a database of reference audio contents. Some of these applications are exemplified below.
Peters et al disclose in U.S. patent application Ser. No. 10/749,979 a method and apparatus for identifying ambient audio captured from a microphone and presenting to the user content associated with such identified audio. Similar methods are described in International Patent App. No. PCT/US2006/045551 (assigned to Google) for identifying ambient audio corresponding to a media broadcast, presenting personalized information to the user in response to the identified audio, and a number of other interactive applications.
U.S. patent application Ser. No. 09/734,949 (assigned to Shazam) describes a method and system for interacting with users, upon a user-provided sample related to his/her environment that is delivered to an interactive service in order to trigger events, with such sample including (but not limited to) a microphone capture.
U.S. patent application Ser. No. 11/866,814 (assigned to Shazam) describes a method for identifying a content captured from a data stream, which can be audio broadcast from a broadcast source such as a radio or TV station. The described method could be used for identifying a song within a radio broadcast.
Wang et al describe in U.S. patent application Ser. No. 10/831,945 a method for performing transactions, such as music purchases, upon the identification of a captured sound using, among others, a robust audio hashing method.
The use of robust hashing is also considered by R. Reisman in U.S. patent application Ser. No. 10/434,032 for interactive TV applications. Lu et al. consider in U.S. patent application Ser. No. 11/595,117 the use of robust audio hashes for performing audience measurements of broadcast programs.
Many techniques for performing audio identification exist. When one has the certainty that the audio to be identified and the reference audio exist in bit-by-bit exact copies, traditional cryptographic hashing techniques can be used to efficiently perform searches. However, if the audio copies differ a single bit, this approach fails. Other techniques for audio identification rely on attached meta-data, but they are not robust against format conversion, manual removal of the meta-data, D/A/D conversion, etc. When the audio can be slightly or severely distorted, other techniques which are sufficiently robust to such distortions must be used. Those techniques include watermarking and robust audio hashing. Watermarking-based techniques assume that the content to be identified conveys a certain code (watermark) that has been a priori embedded. However, watermark embedding is not always feasible, either for scalability reasons or other technological shortcomings. Moreover, if an unwatermarked copy of a given audio content is found, the watermark detector cannot extract any identification information from it. In contrast, robust audio hashing techniques do not need any kind of information embedding in the audio contents, thus rendering them more universal. Robust audio hashing techniques analyze the audio content in order to extract a robust descriptor, usually known as robust hash or fingerprint, that can be compared with other descriptors stored in databases.
Many robust audio hashing techniques exist. A review of the most popular existing algorithms can be found in the article by Cano et al. entitled “A review of audio fingerprinting”, Journal of VLSI Signal Processing 41, 271-284, 2005. Some of the existing techniques are intended to identify complete songs or audio sequences, or even CDs or playlists. Other techniques are aimed to identify a song or an audio sequence using only a small fragment of it. Usually, the latter can be adapted to perform identification in streaming mode, i.e. capturing successive fragments from an audio stream and performing comparison with databases where the reference contents are not necessarily synchronized with those that have been captured. This is the most common operating mode for performing identification of broadcast audio and microphone-captured audio, in general.
Most methods for performing robust audio hashing divide the audio stream in contiguous blocks of short duration, usually with a significant degree of overlapping. For each of these blocks, a number of different operations are applied in order to extract distinctive features in such a way that they are robust to a given set of distortions. These operations include, on one hand, the application of signal transforms such as the Fast Fourier Transform (FFT), Modulated Complex Lapped Transform (MCLT), Discrete Wavelet Transform, Discrete Cosine Transform (DCT), Haar Transform or Walsh-Hadamard Transform, and others. Another processing which is common to most robust audio hashing methods is the separation of the transformed audio signals in sub-bands, emulating properties of the human auditory system, in order to extract perceptually meaningful parameters. A number of features can be extracted from the processed audio signals, namely Mel-Frequency Cepstrum Coefficients (MFCC), Spectral Flatness Measure (SFM), Spectral Correlation Function (SCF), the energy of the Fourier coefficients, the spectral centroids, the zero-crossing rate, etc. On the other hand, further common operations include frequency-time filtering to eliminate spurious channel effects and to increase decorrelation, and the use of dimensionality reduction techniques such as Principal Components Analysis (PCA), Independent Component Analysis (ICA), or the DCT.
A well known method for robust audio hashing that fits in the general description given above is described in the European patent No. 1362485 (assigned to Philips). The steps of this method can be summarized as follows: partitioning the audio signal in fixed-length overlapping windowed segments, computing the spectrogram coefficients of the audio signal using a 32-band filterbank in logarithmic frequency scale, performing a 2D filtering of the spectrogram coefficients, and quantizing the resulting coefficients with a binary quantizer according to its sign. Thus, the robust hash is composed of a binary sequence of 0s and 1s. The comparison of two robust hashes takes place by computing their Hamming distance. If such distance is below a certain threshold, then the two robust hashes are assumed to represent the same audio signal. This method provides reasonably good performance under mild distortions, but in general it is severely degraded under real-world working conditions. A significant number of subsequent works have added further processing or modified certain parts of the method in order to improve its robustness against different types of distortions.
The method described in EP1362485 is modified in the international patent application PCT/IB03/03658 (assigned to Philips) in order to gain resilience against changes in the reproduction speed of audio signals. In order to deal with the misalignments in the temporal and frequency domain caused by speed changes, the method introduces an additional step in the method described in EP1362485. This step consists in computing the temporal autocorrelation of the output coefficients of the filterbank, whose number of bands is also increased from 32 to 512. The autocorrelation coefficients can be optionally low-pass filtered in order to increase the robustness.
The article by Son et al. entitled “Sub-fingerprint Masking for a Robust Audio Fingerprinting System in a Real-noise Environment for Portable Consumer Devices”, published in IEEE Transactions on Consumer Electronics, vol. 56, No. 1, February 2010, proposes an improvement over EP1362485 consistent on computing a mask for the robust hash, based on the estimation of the fundamental frequency components of the audio signal that generates the reference robust hash. This mask, which is intended to improve the robustness of the method disclosed in EP1362485 against noise, has the same length as the robust hash, and can take the values 0 or 1 in each position. For comparing two robust hashes, first they are element-by-element multiplied by the mask, and then their Hamming distance is compared as in EP1362485. Park et al. also pursue improved robustness against noise in the article “Frequency-temporal filtering for a robust audio fingerprinting scheme in real-noise environments”, published in ETRI Journal, Vol. 28, No. 4, 2006. In such article the authors study the use of several linear filters for replacing the 2D filter used in EP1362485, keeping unaltered the remaining components.
Another well-known robust audio hashing method is described in the European patent No. 1307833 (assigned to Shazam). The disclosed method computes a series of “landmarks” or salient points (e.g. spectrogram peaks) of the audio recording, and it computes a robust hash for each landmark. In order to decrease the probability of false alarm, the landmarks are linked to other landmarks in their vicinity. Hence, each audio recording is characterized by a list of pairs [landmark, robust hash]. The method for comparison of audio signals consists of two steps. The first step compares the robust hashes of each landmark found in the query and reference audio, and for each match it stores a pair of corresponding time locations. The second step represents the pairs of time locations in a scatter plot, and a match between the two audio signals is declared if such scatter plot can be well approximated by a unit-slope line. U.S. Pat. No. 7,627,477 (assigned to Shazam) improves the method described in EP1307833, especially in what regards resistance against speed changes and efficiency in matching audio samples.
In some recent research articles, such as the article by Cotton and Ellis “Audio fingerprinting to identify multiple videos of an event” in IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, and Umapathy et al. “Audio Signal Processing Using Time-Frequency Approaches: Coding, Classification, Fingerprinting, and Watermarking”, in EURASIP Journal on Advances in Signal Processing, 2010, the proposed robust audio hashing methods decompose the audio signal in over-complete Gabor dictionaries in order to create a sparse representation of the audio signal.
The methods described in the patents and articles referenced above do not explicitly consider solutions to mitigate the distortions caused by multipath audio propagation and equalization, which are typical in microphone-captured audio identification, and which impair very seriously the identification performance if they are not taken into account. This kind of distortions has been considered in the design of other methods, which are reviewed below.
The international patent PCT/ES02/00312 (assigned to Universitat Pompeu-Fabra) discloses a robust audio hashing method for songs identification in broadcast audio, which regards the channel from the loudspeakers to the microphone as a convolutive channel. The method described in PCT/ES02/00312 transforms the spectral coefficients extracted from the audio signal to the logarithmic domain, with the aim of transforming the effect of the channel in an additive one. It then applies a high-pass linear filter in the temporal axis to the transformed coefficients, with the aim of removing the slow variations which are assumed to be caused by the convolutive channel. The descriptors extracted for composing the robust hash also include the energy variations as well as first and second order derivatives of the spectral coefficients. An important difference between this method and the methods referenced above is that, instead of quantizing the descriptors, the method described in PCT/ES02/00312 represents the descriptors by means of Hidden Markov Models (HMM). The HMMs are obtained by means of a training phase performed over a songs database. The comparison of robust hashes is done by means of the Viterbi algorithm. One of the drawbacks of this method is the fact that the log transform applied for removing the convolutive distortion transforms the additive noise in a non-linear fashion. This causes the identification performance to be rapidly degraded as the noise level of the audio capture is increased.
Other methods try to overcome the distortions caused by microphone capture resorting to techniques originally developed by the computer vision community, such as machine-learning. In the article “Computer vision for music identification”, published in Computer Vision and Pattern Recognition, IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 1, July 2005, Ke et al. generalize the method disclosed in EP1362485. Ke et al. extract from the music files a sequence of spectral sub-band energies that are arranged in a spectrogram; which is regarded as a digital image. The pairwise Adaboost technique is applied on a set of Viola-Jones features (simple 2D filters, that generalize the filter used in EP1362485) in order to learn the local descriptors and thresholds that best identify the musical fragments. The generated robust hash is a binary string, as in EP1362485, but the method for comparing robust hashes is much more complex, computing a likelihood measure according to an occlusion model estimated by means of the Expectation Maximization (EM) algorithm. Both the selected Viola-Jones features and the parameters of the EM model are computed in a training phase that requires pairs of clean and distorted audio signals. The resulting performance is highly dependent on the training phase, and also presumably on the mismatch between the training and capturing conditions. Furthermore, the complexity of the comparison method makes it not advisable for real time applications.
In the article “Boosted binary audio fingerprint based on spectral subband moments”, published in IEEE International Conference on Acoustics, Speech and Signal Processing, vol. 1, pp. 241-244, April 2007, Kim and Yoo follow the same principles of the method proposed by Ke et al. Kim and Yoo also resort to the Adaboost technique, but using normalized spectral sub-band moments instead of spectral sub-band energies.
U.S. patent App. No. 60/823,881 (assigned to Google) also discloses a method for robust audio hashing based on techniques commonly used in the field of computer vision, inspired by the insights provided by Ke et al. However, instead of applying Adaboost this method applies 2D wavelet analysis on the audio spectrogram, which is regarded as a digital image. The wavelet transform of the spectrogram is computed, and only a limited number of meaningful coefficients is kept. The coefficients of the computed wavelets are quantized according to their sign, and the Min-Hash technique is applied in order to reduce the dimensionality of the final robust hash. The comparison of robust hashes takes place by means of the Locality-Sensitive-Hashing technique in order for the comparison to be efficient in large databases, and dynamic-time warping in order to increase robustness against temporal misalignments.
Other methods try to increase the robustness against frequency distortions by applying some normalization to the spectral coefficients. The paper by Sukittanon and Atlas, “Modulation frequency features for audio fingerprinting”, presented in IEEE International Conference of Acoustics, Speech and Signal Processing, May 2002, is based on modulation frequency analysis in order to characterize the time-varying behavior of the audio signal. A given audio signal is first decomposed in a set of frequency sub-bands, and the modulation frequency of each sub-band is estimated by means of a wavelet analysis at different time scales. At this point, the robust hash of an audio signal consists in a set modulation frequency features at different time scales in each sub-band. Finally, for each frequency sub-band, the modulation frequency features are normalized by scaling them uniformly by the sum of all the modulation frequency values computed for a given audio fragment. This approach has several drawbacks. On one hand, it assumes that the distortion is constant throughout the duration of the whole audio fragment. Thus, variations in the equalization or volume that occur in the middle of the analyzed fragment will negatively impact its performance. On the other hand, in order to perform the normalization it is necessary to wait until a whole audio fragment is received and its features extracted. These, drawbacks make the method not advisable for real-time or streaming applications.
U.S. Pat. No. 7,328,153 (assigned to Gracenote) describes a method for robust audio hashing that decomposes windowed segments of the audio signals in a set of spectral bands. A time-frequency matrix is constructed wherein each element is computed from a set of audio features in each of the spectral bands. The used audio features are either DCT coefficients or wavelet coefficients for a set of wavelet scales. The normalization approach is very similar to that in the method described by Sukittanon and Atlas: in order to improve the robustness against frequency equalization, the elements of the time-frequency matrix are normalized in each band by the mean power value in such band. The same normalization approach is described in U.S. patent application Ser. No. 10/931,635.
In order to further improve the robustness against distortions, many robust audio hashing methods apply in their final steps a quantizer to the extracted features. Quantized features are also beneficial for simplifying hardware implementations and reducing memory requirements. Usually, these quantizers are simple binary scalar quantizers although vector quantizers, Gaussian Mixture Models and Hidden Markov Models are also described in the previous art.
In general, and in particular when scalar quantizers are used, the quantizers are not optimally designed in order to maximize the identification performance of the robust hashing methods. Furthermore, for computational reasons, scalar quantizers are usually preferred since vector quantization is highly time-consuming, especially when the quantizer is non-structured. The use of multilevel quantizers (i.e. with more than two quantization cells) is desirable for increasing the discriminability of the robust hash. However, multilevel quantization is particularly sensitive to distortions such as frequency equalization, multipath propagation and volume changes, which occur in scenarios of microphone-captured audio identification. Hence, multilevel quantizers cannot be applied in such scenarios unless the hashing method is robust by construction to those distortions. A few works describe scalar quantization methods adapted to the input signal.
U.S. patent application Ser. No. 10/994,498 (assigned to Microsoft) describes a robust audio hashing method that performs computation of first order statistics of MCLT-transformed audio segments, performs an intermediate quantization step using an adaptive N-level quantizer that is obtained from the histogram of the signals, and finally quantizes the result using an error correcting decoder, which is a form of vector quantizer. In addition, it considers a randomization for the quantizer depending on a secret key.
Allamanche et al. describe in U.S. patent application Ser. No. 10/931,635 a method that also uses a scalar quantizer adapted to the input signal. In one embodiment, the quantization step is a function of the magnitude of the input values: it is larger for large values and smaller for small values. In another embodiment, the quantization steps are set in order to keep the quantization error within a predefined range of values. In yet another embodiment, the quantization step is larger for values of the input signal occurring with small relative frequency, and smaller for values of the input signal occurring with higher frequency.
The main drawback of the methods described in U.S. patent application Ser. No. 10/931,635 and U.S. patent application Ser. No. 10/994,498 is that the optimized quantizer is always dependent on the input signal, making it suitable only for coping with mild distortions. Any moderate or severe distortion will likely cause the quantized features to be significantly different for the test audio and the reference audio, thus increasing the probability of missing correct audio matches.
As it has been explained, the existing robust audio hashing methods still present numerous deficiencies that make them not suitable for real time identification of streaming audio captured with microphones. In this scenario, a robust audio hashing scheme must fulfill several requirements:                Computational efficiency in the robust hash generation. In many cases, the task of computing the robust audio hashes must be carried out in electronic devices performing a number of different simultaneous tasks and with small computational power (e.g. a user laptop, a mobile device or an embedded device). Hence, keeping a small computational complexity in the robust hash computation is of high interest.        Computational efficiency in the robust hash comparison. In some cases, the robust hash comparison must be run on big databases, thus demanding for efficient search and match algorithms. A significant number of methods fulfilling this characteristic exist. However, there is another related scenario which is not well addressed in the prior art: a large number of users concurrently performing queries to a server, where the size of the reference database is not necessarily large. This is the case, for instance, robust-hash-based audience measurement for broadcast transmissions, or in robust-hash-based interactive services, where both the number of users and the amount of queries per second to the server can be very high. In this case, the emphasis in efficiency must be put in the comparison method rather than in the search method. Therefore, this latter scenario places the requirement that the robust hash comparison must be as simple as possible, in order to minimize the number of comparison operations.        High robustness to microphone-capture channels. When capturing streaming audio with microphones, the audio is subject to distortions like echo addition (due to multipath propagation of the audio), equalization and ambient noise. Moreover, the capturing device, for instance a microphone embedded in an electronic device, such as a cell phone or a laptop, introduces more additive noise and possibly nonlinear distortions. Hence, the expected Signal to Noise Ratio (SNR) in this kind of applications is very low (usually in the order of 0 dBs or even smaller). One of the main difficulties is to find a robust hashing method which is highly robust to multipath and equalization and whose performance does not dramatically degrade for low SNRs. As it has been seen, none of the existing robust hashing methods are able to completely fulfill this requirement.        Reliability. Reliability is measured in terms of probability of false positive (PFP) and miss-detection (PMD). PFP measures the probability that a sample audio content is incorrectly identified, i.e. it is matched with another audio content which is not related to the sample audio. If PFP is high, then the robust audio hashing scheme is said to be not sufficiently discriminative. PMD measures the probability that the robust hash extracted from a sample audio content does not find any correspondence in the database of reference robust hashes, even when such correspondence exists. When PMD is high, the robust audio hashing scheme is said to be not sufficiently robust. While it is desirable to keep PMD as low as possible, the cost of false positives is in general much higher than that of miss-detections. Thus, for most applications it is preferable to keep the probability of false alarm very low, being acceptable to have a moderately high probability of miss-detection.        