Audio matching provides for identification of a recorded audio sample by comparing the audio sample to a set of reference samples. To make the comparison, an audio sample can be transformed to a time-frequency representation of the sample by using, for example, a short time Fourier transform (STFT). Using the time-frequency representation, interest points that uniquely characterize time and frequency locations of peaks or other distinct patterns of the spectrogram can then be extracted from the audio sample. Descriptors can be computed as functions of sets of interest points. Descriptors of the audio sample can then be compared to descriptors of reference samples to determine identity of the audio sample.
In a typical descriptor audio matching system, interest points uniquely characterize an audio signal; thus, there is likely little overlap between interest points of two different segments of the audio sample. Pitch-shifting can affect an audio sample by shifting the frequency of interest points. For example, when trying to match audio played on the radio, television, or in a remix of a song, the speed of the audio sample may be slightly changed from the original. Samples that have altered speed will also likely have an altered pitch. Even a small pitch shift that is hard to notice for listeners may present difficult challenges in matching the pitch shifted signal due to interest points being altered from the pitch shift. Therefore, it is desirable to identify and use supplementary features of interest points that can be incorporated within a descriptor or supplemented to a descriptor in a manner that is robust to pitch shift distortion.