Audio matching provides for identification of a recorded audio sample by comparing an audio sample to a set of reference samples. One example of a recorded audio sample can be an audio track of a video. To make the comparison, an audio sample can be transformed to a time-frequency representation of the sample by using, for example, a short time Fourier transform (STFT). Using the time-frequency representation, interest points that characterize time and frequency locations of peaks or other distinct patterns of the spectrogram can then be extracted from the audio sample. Descriptors can be computed as functions of sets of interest points. Descriptors of the audio sample can then be compared to descriptors of reference samples to determine the identity of the audio sample.
In a typical descriptor audio matching system, interest points uniquely characterize an audio signal; thus, there is likely little overlap between interest points of two different segments of the audio sample. Pitch-shifting can affect an audio sample by shifting the frequency of interest points. Time stretching can affect an audio sample by shifting the time of interest points. For example, when trying to match audio played on the radio, television, or in a remix of a song, the speed of the audio sample may be slightly changed from the original. A change in speed can change the timing of interest points within an audio sample. In addition, samples that have altered speed will also likely have an altered pitch. Even a small pitch shift that is hard to notice for listeners may present difficult challenges in matching the pitch shifted signal due to interest points being altered from the pitch shift. Therefore, it is desirable to identify and use supplementary features of interest points that can be incorporated within a descriptor or supplemented to a descriptor in a manner that are robust to both pitch shift distortion and time stretching.