Audio matching provides for identification of a recorded audio sample by comparing the audio sample to a set of reference samples. To make the comparison, an audio sample can be transformed to a time-frequency representation of the sample by using, for example, a short time Fourier transform (STFT). Using the time-frequency representation, interest points that characterize time and frequency locations of peaks or other distinct patterns of the spectrogram can then be extracted from the audio sample. Descriptors can be computed as functions of sets of interest points. Descriptors of the audio sample can then be compared to descriptors of reference samples to determine the identity of the audio sample.
In a typical descriptor audio matching system, the system can match the audio of a probe sample, e.g., a user uploaded audio clip, against a set of references, allowing for a match in any range of the probe sample and a reference sample. In order to match any range of the probe sample with any range of the reference sample, conventional systems generate descriptors of the probe sample based on snapshots of the probe sample at different times, which are looked up in an index of corresponding snapshots from reference samples. When a probe sample has two matching snapshots pairs, they can be combined during matching to time align the probe sample and reference sample. In this type of system, the size of a descriptor grows as the size of the audio sample becomes longer. Storing descriptors associated with hundreds of millions or billions of audio clips becomes difficult to scale with large numbers of descriptors.
In some audio matching systems, the system can be tuned to match the entirety of an audio clip, e.g., finding full duplicates. For example, an audio matching system may be used to discover the identity of full audio tracks in a user's collection of songs against a reference database of known songs. In another example, an audio matching system may be used to discover duplicates within a large data store or collection of audio tracks. Using descriptors capable of matching any range of a probe sample to any range of a reference sample could work for the previous examples; however, using more compact descriptors for the purpose of matching an entire audio track can be more efficient and allow the system to scale to billions of reference samples. Therefore an ability to generate and use more compact descriptors can be beneficial in audio matching.