Content-based audio recognition is the process of identifying similarities between the audio content of audio files. Performing content-based audio recognition usually involves comparing the audio content of a given audio file, called the query audio file, to the audio content of one or more other audio files, called the reference audio file(s). In many commercial applications, the number of reference audio files is very large, possibly in the order of millions.
The need for accurate, fast, and scalable content-based audio recognition is readily apparent in a wide range of practical situations. For example, the owner of a large musical catalogue may wish to determine whether a newly delivered song exists within that catalogue, even if the musical catalogue contains many millions of entries, and even if the arriving song has no associated metadata besides the audio signal.
Many different content-based audio identification methods are well-known in the prior art. Generally speaking, such methods consist of four phases. In a reference fingerprint ingestion phase, one or more fingerprints, called reference fingerprints, are extracted from the audio content information in each of the reference audio files, and stored into a database, called the reference database. In a query fingerprint extraction phase, one or more fingerprints, called query fingerprints, are extracted from the audio content information in the query audio file. In a fingerprint matching phase, the query fingerprints are compared to the reference fingerprints in the reference database, to assess their similarity. In a decision-making phase, a set of decision-making rules are applied to assess whether audio content of the query audio file is similar (or identical) to audio content of one or more of the reference audio files.
A given content-based audio recognition method's performance depends heavily on the methods it uses in each of these phases. The methods used in these phases vary considerably across content-based audio recognition methods in the prior art. Some content-based audio recognition techniques may implement these four phases in a different order to the one presented here, some content-based audio recognition techniques may implement additional phases, and some content-based audio recognition techniques may implement several phases in parallel. Nevertheless, these phases form the core of content-based audio recognition.
Whereas the format and information content of fingerprints may vary between applications, all content-based audio recognition methods share an important similarity. By their very nature, the fingerprints that content-based audio identification methods extract have a functional link to the sounds or events (such as rhythmic or melodic structure or timbre) in the corresponding audio files. In prior-art content-based audio identification methods, these fingerprints are typically extracted using pre-specified recipes. For example, a method for extracting fingerprints is disclosed in U.S. Pat. No. 8,586,847, by Ellis et al. Here, a music sample is filtered into a plurality of frequency bands, and inter-onset intervals are detected within each of these bands. Codes are generated by associating frequency bands and inter-onset intervals. For a given sample, all generated codes, along with the time stamps indicating when the associated onset occurred within the music sample, are combined to form a fingerprint.
This functional link between fingerprints and sounds or events creates several important problems when implementing existing content-based audio recognition methods in practice. For example, in audio files with repetitive audio content, such as recorded music, audio repetitions can cause a single query fingerprint to be similar to a large number of reference fingerprints, or vice-versa. This can cause content-based audio recognition methods to reach an erroneous conclusion that a query audio file and a reference audio file are similar, even in cases where these audio files share no audio content information except the repetitive audio content. Hence, an improved way of making a more accurate decision regarding such possible similarities would be advantageous.
Therefore, in addition to a reference fingerprint ingestion phase, a query fingerprint extraction phase, a fingerprint matching phase and a decision-making phase, the present inventors have realized that it is also very important to implement another phase, called a fingerprint clustering phase, to cluster fingerprints into groups of similar fingerprints.