Audio matching provides for identification of a recorded audio sample by comparing an audio sample to a set of reference samples. One example of a recorded audio sample can be an audio track of a video. To make the comparison, an audio sample can be transformed to a time-frequency representation of the sample by using, for example, a short time Fourier transform (STFT). Using the time-frequency representation, interest points that characterize time and frequency locations of peaks or other distinct patterns of the spectrogram can then be extracted from the audio sample. Fingerprints can be computed as functions of sets of interest points. Fingerprints of the audio sample can then be compared to fingerprints of reference samples to determine the identity of the audio sample.
Different types of fingerprints can be used for audio matching. For example, a melody fingerprint can be generated by incorporating interest points of an audio sample related to musical composition. In contrast, audio-id fingerprints can be generated by incorporating interest points of an audio sample related to every aspect of the audio sample to aid in identifying the exact same sound recording. Because audio-id is designed for high precision and exactness, audio matching using solely audio-id fingerprints can fail to identify pitch-shifted audio samples of a reference as the reference. In a media sharing service, the media sharing platform is reliant on users to provide uploaded content. In general, the media sharing service provider has little control over what content users can upload into the system, beyond limiting acceptable file formats, for example. An audio matching system that has no control over the content that it is to match benefits from being resistant to pitch-shifted content uploaded by users. Therefore there exists a need to improve audio-id matching to be more resistant to pitch-shifting.