A number of systems exist for the automatic identification of an audio signal. Some such systems rely on an acoustic “fingerprint” of an input audio signal to identify the audio signal. The fingerprint is a condensed digital summary of salient features of the input audio signal. Once generated, the fingerprint is compared to a number of fingerprints of known audio signals stored in a database. If a matching fingerprint is found, the input audio signal is determined to be a duplicate or very similar copy of the known audio signal having the matching fingerprint.
A key factor in the effectiveness of existing acoustic fingerprinting systems is the quality and similarity of the input audio signal and the known matching signal. The more closely the two signals match one another, the more accurately an input audio signal can be matched to a stored signal.
For example, some fingerprinting systems are arranged to identify input music signals using a database of stored original signals. The systems allow a user to capture a sample of a music or other audio signal, for example sampled from a broadcast radio, or television signal. That captured or sampled signal is then “fingerprinted,” and the fingerprint is compared to previously-fingerprinted copies of thousands of known, original music or other audio signals stored in a database. If a matching fingerprint is found, the captured music signal can be identified based on data associated with the matching original.
Such systems are highly effective when the captured input music signal is a near perfect copy of the original music signal stored within the database. Even though the broadcast may include some noise, minor frequency alteration, compression, or equalization and ambient noises may be present, the broadcast audio signal almost perfectly matches the original copy. Thus, the fingerprints, but for small variations created by noise or global amplitude variations caused by compression or poor recording signal strength, are very similar, making a match easier to detect. In contrast, if the input audio signal displays variations in the speed (e.g., faster or slower) of the acoustic sample, relative to the original, the systems have severe difficulty in identifying the input audio signal.
This problem is further exacerbated when comparing signals where no true “original” can be identified or where the sample is not an exact copy of an “original.” For example, such systems struggle and generally fail when attempting to identify a “live” version of a piece of music, even when performed in the same key and by the same artist, when the “live” version includes tempo changes or other artistic variations that cause the fingerprint of the “live” version to differ from the “original fingerprint.”
As another example, such systems fail in the analysis of bird calls or other naturally occurring sounds. A bird may sing the same song repetitively. However, in the case of bird calls, for example, there is no ‘perfect’ original bird call. Every time a bird sings a song there is some variation from one rendition to the next and no perfect song can be captured. To the casual human listener, the perceptual qualities contained within the song (e.g. amplitude, pitch, and tempo) often sound unchanging and repetitive. But closer analysis shows that significant variation exists in avian vocalizations, so much so that application of existing technologies used to identify music or video cannot be applied to bird vocalizations.
Therefore, it would be desirable to have a system and method for analyzing and identifying highly-variable audio signals, such as avian vocalizations.