Audio matching provides for identification of a recorded audio sample (e.g., an audio track of a video) by comparing the audio sample to a set of reference samples. To make the comparison, an audio sample can be transformed to a time-frequency representation of the sample (e.g., by using a short time Fourier transform (STFT)). Using a time-frequency representation, interest points that characterize time and frequency locations of peaks or other distinct patterns of a spectrogram can be extracted from the audio sample. Descriptors can be computed as functions of sets of interest points. Descriptors of the audio sample can then be compared to descriptors of reference samples to determine identity of the audio sample.
In a typical descriptor audio matching system, interest points uniquely characterize an audio signal; thus, there is likely little overlap between interest points of two different segments of the audio sample. When comparing descriptors of an audio sample to tens of millions of reference descriptors in a reference index, multiple potential matches can be identified during the comparison. Potential matches can be validated using a more precise measurement of interest point overlap. For example, identified interest points of the audio sample can be compared to known interest points in potential matching reference samples to validate whether the audio sample and the reference sample are a match or instead a false positive. However, in a voluminous database of tens of millions of reference samples, there may still be multiple reference samples with a relatively high degree of overlap of interest points, hampering the validation process.