Audio matching provides for identification of a recorded audio sample (e.g., an audio track of a video) by comparing the audio sample to a set of reference samples. To make the comparison, an audio sample can be transformed to a time-frequency representation (e.g., by employing a short time Fourier transform). Using a time-frequency representation, interest points that characterize time and frequency locations of peaks or other distinct patterns of a spectrogram can be extracted from the audio sample. Audio fingerprints can be computed as functions of sets of interest points. Audio fingerprints of the audio sample can then be compared to audio fingerprints of reference samples to determine identity of the audio sample.
Video matching works similarly to audio matching in that it provides for identification of a recorded video sample by comparing video frames features of the video sample to a set of reference video features related to a set of reference videos. To make the comparison, a set of mean frames of the video sample can be identified based on a sliding time window over the video sample. Unique video features based on the set of mean frames can then be identified. Video fingerprints can be generated based on unique video features identified through the mean frames. Video fingerprints of the video sample can then be compared to video fingerprints of reference videos to determine identity of the video sample.
In some content matching systems, it is desirable to identify strictly audiovisual matches, e.g., matches that occur on both the audio channel and video channel. In most audiovisual content matching systems, audio and video channels are treated independently, using different fingerprints and matching steps, and then merging results from the audio matching process and the video matching process to determine an audiovisual match.