Audio matching provides for identification of a recorded audio sample by comparing an audio sample to a set of reference samples. One example of a recorded audio sample can be an audio track of a video. To make the comparison, an audio sample can be transformed to a time-frequency representation of the sample by using, for example, a short time Fourier transform (STFT). Using the time-frequency representation, interest points that characterize time and frequency locations of peaks or other distinct patterns of the spectrogram can then be extracted from the audio sample. Descriptors can be computed as functions of sets of interest points. Descriptors of the audio sample can then be compared to descriptors of reference samples to determine the identity of the audio sample.
Typically, a match is determined between a probe sample and a specific reference sample by examining hits which are in common between descriptors of the probe sample and descriptors of the specific reference sample. Respective hits can be associated with a time in the probe sample and a time in the reference sample. As hits indicate a match at a particular point in time, additional hits can be aggregated over time by looking along a projection of hits. However, if a probe sample is sped up or slowed down, relative to a reference sample, the probe hit time and the reference hit time may not align in a manner that indicates positive match. This can present challenges during audio matching, as transformations that affect speed of audio samples are common, for example, samples recorded over broadcast radio. Thus, there exists a need to accurately match audio samples suffering from time stretch or time compression distortions.