This disclosure generally relates to audio identification, and more specifically to detecting distorted audio signals based on audio fingerprinting.
An audio fingerprint is a compact summary of an audio signal that can be used to perform content-based identification. For example, existing audio signal identification systems use various audio signal identification schemes to identify the name, artist, and/or album of an unknown song. When presented with an unidentified audio signal, an audio signal identification system is configured to generate an audio fingerprint for the audio signal, where the audio fingerprint includes characteristic information about the audio signal usable for identifying the audio signal. The characteristic information about the audio signal may be based on acoustical and perceptual properties of the audio signal. Using fingerprints and matching algorithms, the audio fingerprint generated from the audio signal is compared to a database of reference audio fingerprints for identification of the audio signal.
Audio fingerprinting techniques should be robust to a variety of distortions due to noisy transmission channels or specific sound processing. Pitch shifting and tempo shifting are two of the most common and problematic types of distortions to most existing audio identification systems based on analysis of spectral content. Pitch shifting refers to raising or lowering the original pitch of an audio signal. When pitch shifting occurs, all the frequencies of the audio signal in the spectrum are multiplied by a factor. Tempo shifting or variation refers to a playing an audio signal slower or faster than its original speed. Since spectral content of an audio signal is either stretched along the time axis (tempo variations or shifting) or shifted along the frequency axis (pitching shifting), existing audio identification solutions based on the analysis of spectral content are often not robust enough to accurately identify distorted versions of an audio signal.
Various existing solutions are provided by audio identification systems to detect distorted versions of audio signals, such as solutions involving computing Hamming distance between two sub-fingerprints of audio signals. Using a lower Hamming distance as a threshold, a higher matching rate between the sub-fingerprints will be found. However, a pitch shift can lead to significant changes in spectral content of an audio signal, resulting in a high Hamming distance and consequently a low matching rate. One of the possible solutions is to extract several indexes, each corresponding to a given pitch shift, and to then match a sub-fingerprint being evaluated to all the indexes. However, this approach introduces additional computational load to the matching process and additional space to store multiple fingerprint versions.