There is a growing need for automatic recognition of broadcast signals such as videos, music or other audio or video signals generated from a variety of sources. Sources for the broadcast signals can include, but are not limited to terrestrial radio, satellite radio, internet audio and video, cable television, terrestrial television broadcasts, and satellite television. Because of the growing number of broadcast media, owners of copyrighted works or advertisers are interested in obtaining data on the frequency of broadcast of their material. Music tracking services provide playlists of major radio stations in large markets. Any sort of continual, real-time or near real-time recognition is inefficient and labor intensive when performed by humans. An automated method of monitoring large numbers of broadcast sources, such as radio stations and television stations, and recognizing the content of those broadcasts would thus provide significant benefit to copyright holders, advertisers, artists, and a variety of industries.
Traditionally, recognition of audio broadcasts, such as songs played on the radio has been performed by matching radio stations and times at which songs were played with playlists provided either by the radio stations or from third party sources. This method is inherently limited to only radio stations for which information is available. Other methods rely can rely on statistical sampling of broadcasts, the results of which are then used to estimate actual playlists for all broadcast stations. Still other methods rely on embedding inaudible codes within broadcast signals. The embedded signals are decoded at the receiver to extract identifying information about the broadcast signal. The disadvantage of this method is that special decoding devices are required to identify signals, and only those songs with embedded codes can be identified.
Copyright holders, such as for music or video content, are generally entitled to compensation for each instance that their song or video is played. For music copyright holders in particular, determining when their songs are played on any of thousands of radio stations, both over the air, and now on the internet, is a daunting task. Traditionally, copyright holders have turned over collection of royalties in these circumstances to third party companies who charge entities who play music for commercial purposes a subscription fee to compensate their catalogue of copyright holders. These fees are then distributed to the copyright holders based on statistical models designed to compensate those copyright holders according which songs are receiving the most play. These statistical methods have only been very rough estimates of actual playing instances based on small sample sizes.
Any large-scale recognition system requires content-based retrieval, in which an unidentified broadcast signal is compared with a database of known signals to identify similar or identical database signals. Content-based retrieval is different from existing audio retrieval by web search engines, in which only the metadata text surrounding or associated with audio files is searched. Also, while speech recognition is useful for converting voiced signals into text that can then be indexed and searched using well-known techniques, it is not applicable to the large majority of audio signals that contain music and sounds. Audio signals lack easily identifiable entities such as words that provide identifiers for searching and indexing. As such, current audio retrieval schemes index audio signals by computed perceptual characteristics that represent various qualities or features of the signal.
Further, existing large scale recognition systems are generally considered large scale as measured by the size of the database of elements, songs for example, that have been characterized and can be matched against the incoming broadcast stream. They are not large scale from the standpoint of the number of broadcast streams that can be continually monitored or the number of simultaneous recognitions that can occur.
What is needed is a system and method for recognizing elements, either video or audio, simultaneously across a large number of broadcast media streams.