Audio fingerprinting and song recognition technologies offer the ability to recognize a song from a user defined audio clip, usually 30 seconds long or less. A smartphone equipped with this technology (e.g., Shazam, SoundHound, or Gracenote) can quickly identify an unknown song playing on the radio when the user defines an audio clip for the smartphone. Similar services can scan a music library (i.e., a collection of files, each file representative of a song) on a computer to correct metadata associated with each file and detect duplicate songs.
While useful for identifying user-defined audio clips, these systems depend on a user definition of a sound clip (i.e., user input instructing the system to start analysis of an ongoing audio clip or existing audio files stored in a predetermined location like a music library), and they cannot operate without oversight. The detection accuracy is generally less than 90%, and they cannot operate continuously and on the fly to pick songs out of an audio stream including sounds other than a song (e.g., a live show or a radio station broadcast) such as speech.
In other words, one limitation of existing techniques is that they simply record only a short segment of audio, carry out a fingerprinting algorithm on this audio clip or segment, and attempt to match that fingerprint to an existing database of audio fingerprints. These systems have difficulty distinguishing between various versions of the same song (including different recordings, recording artists, song edits and cuts, etc.); lack robustness with noisy audio signals; identify incorrect songs when multiple songs have very similar audio characteristics (key, tempo, instruments, rhythm, etc.); and are unable to detect song endings and beginnings, or in other words, to detect song boundaries within a continuous audio stream.
When trying to detect songs within a plurality of continuous audio streams, a single computer, server, or virtual machine will often exceed the memory capacity and/or allocation of the machine upon which the detection system is operating. Additionally, if too many streams are aggregated at a single data center, then temporal disruption will occur and/or streams will be dropped altogether affecting the ability to monitor and detect songs in continuous audio streams, and the performance of other latency dependent services at the data center will be compromised.