Audio matching provides for identification of a recorded audio sample by comparing an audio sample to a set of reference samples. One example of a recorded audio sample can be an audio track of a video. To make the comparison, an audio sample can be transformed to a time-frequency representation of the sample by using, for example, a short time Fourier transform (STFT). Using the time-frequency representation, interest points that characterize time and frequency locations of peaks or other distinct patterns of the spectrogram can then be extracted from the audio sample. Descriptors can be computed as functions of sets of interest points. Descriptors of the audio sample can then be compared to descriptors of reference samples to determine the identity of the audio sample.
In a typical large scale descriptor audio matching system, a set of reference samples can number in the millions or tens of millions. When comparing descriptors of an audio sample with descriptors of millions of reference samples, there can be many reference candidates that contain one or more “hits” (e.g., a shared descriptor at a particular time in the audio sample and the reference candidate) between the audio sample descriptor(s) and reference sample descriptors. One of the reference candidates containing one or more hits is likely a true positive match; however other reference candidates containing hits are likely not and many or all should be discarded as false positives.
Typically, a match is determined between a probe sample and a specific reference sample by examining the hits which are in common between descriptors of the probe sample and the descriptors of the specific reference sample. Each hit can be associated with a time in the probe sample and a time in the reference sample. As hits indicate a match at a particular point in time, additional hits can be aggregated over time by looking along a projection of hits. Generating a projection of hits for each potential match, e.g., any reference descriptor containing a hit, in a large scale matching system can be computationally expensive; thus, there exists a need to filter out as many false positive matches as possible prior to generating a projection of hits for potential matches.