Audio matching provides for identification of a recorded audio sample by comparing an audio sample to a set of reference samples. One example of a recorded audio sample can be an audio track of a video. To make the comparison, an audio sample can be transformed to a time-frequency representation of the sample (e.g., by using a short time Fourier transform (STFT)). Using a time-frequency representation, interest points that characterize time and frequency locations of peaks or other distinct patterns of the spectrogram can be extracted from the audio sample. Descriptors can be computed as functions of sets of interest points. Descriptors of the audio sample can then be compared to descriptors of reference samples to determine identity of the audio sample.
In a typical descriptor audio matching system, interest points uniquely characterize an audio signal; thus, there is likely little overlap between interest points of two different segments of the audio sample. Pitch-shifting can affect an audio sample by shifting frequency of interest points. For example, when trying to match audio played on radio, television, or in a remix of a song, speed of the audio sample may be slightly changed from the original. Samples that have altered speed will also likely have an altered pitch. Even a small pitch shift that is hard to notice for listeners may present difficult challenges in matching pitch shifted signal due to interest points being altered from the pitch shift. In addition to pitch shifts, other distortions can be present in an audio signal, e.g., distortions related to a noisy environment where the audio signal was captured, distortions related to a very quiet audio signal, distortions caused by a microphone that captured the audio signal, distortions related to equalization problems, etc. Thus, interest points that do not change in presence of distortion are desirable.
In addition, if there are too many interest points generated for a finite audio sample, the scalability of an audio matching system can be negatively impacted. For example, typically as the amount of interest points generated increases, the amount of descriptors generated increases as well. If additional descriptors are generated for each audio sample within a reference index, the size of the reference index can get too large. However, if there is not enough interest points generated for a finite audio sample, or the interest points generated are not of sufficient quality, accuracy of an audio matching can be negatively impacted. Thus, to increase scalability while maintaining accuracy, it is desirable to have a small, but uniform number of high quality interest points generated from both the audio sample and the reference sample.