Applications for automated content recognition are experiencing considerable growth and are expected to continue to grow fueled by demand from many new commercial opportunities including: interactive television applications providing contextually related content; target advertising; and, tracking media consumption. To address this growth, there is a need for a comprehensive solution related to the problem of creating a media database and identifying, within said database, a particular media segment that is tolerant of media content alterations such as locally-generated graphics within the client device altering the originally transmitted picture or a user watching a standard definition broadcast while using the zoom or stretch mode of their HDTV. These alterations can occur due to user actions such as engaging an electronic program guide (EPG, requesting additional program information that then appears in a set-top-generated pop-up window or selecting a non-standard video mode on a remote.
Automated content recognition systems typically ingest considerable quantities of data and often operate on continuous round-the-clock schedules. The amount of data consumed and managed by said systems qualifies them to be classified by the currently popular idiom of big-data systems. It is therefore imperative that said systems operate as efficiently as possible in regards to both data processing and storage resources as well as with data communications requirements. A fundamental means to increase operational efficiency while still achieving requisite accuracy is to utilize a method of generating a compressed representation of the data to be identified. Said compressed representations are often called fingerprints which are generally associated with identifying data from the audio or video content. Although a diverse range of algorithms of varying complexity are used, most rely on a common set basic principles which have several important properties such as: the fingerprint should be much smaller than the original data; a group of fingerprints representing a media sequence or media segment should be unique such that said group can be identified in a large database of fingerprints; the original media content should not be able to be reconstructed even in a degraded form from a group of fingerprints; and, the system should be able to identify copies of original media even when said copies are diminished or distorted intentionally or by any means of copying or otherwise reproducing said media. Examples of common media distortions include: scaling or cropping image data such as changing from a high-definition video format to a standard definition format or vice-versa, re-encoding the image or audio data to a lower quality level or changing a frame rate of video. Other examples might include decoding digital media to an analog form then digitally re-encoding said media.
A useful example of a typical media fingerprint process can be illustrated by examining the popular mobile phone application (app) called ‘Shazam.’ The Shazam app and many similar apps are typically used to identify a song unknown to the user particularly when heard in a public place such as a bar or restaurant. These apps sample audio from the microphone of a mobile device such as a smartphone or tablet and then generate what is known as a ‘fingerprint’ of the unknown audio to be identified. Said ‘fingerprint’ is generally constructed by detecting frequency events such as the center frequency of a particular sound event above the average of surrounding sounds. This type of acoustic event is called a ‘landmark’ in the Shazam patent U.S. Pat. No. 6,990,453. The system then proceeds to analyze the audio for another such event. When found the first ‘landmark’ and the second ‘landmark’ along with the time interval separating them are sent as a data unit called a ‘fingerprint’ to a remote processing means to be accumulated with additional ‘fingerprints’ for a period of time, usually twenty to thirty seconds. The series of ‘fingerprints’ are then used to search a reference database of known musical works where said database was constructed by said fingerprinting means. The match result is then sent back to the mobile device and, when the match result is positive, identifies the unknown music playing at the location of the user.
Another service, called Viggle identifies TV audio by means of a software app downloaded to the user's mobile device which relays samples of audio from the user's listening location to a central server means for the purpose of identifying said audio by means of an audio matching system. is The service provides means for users of the service to accumulate loyalty points upon identification of TV programs while said users watch said programs. The service user can later redeem said loyalty points for merchandise or services similar to other consumer loyalty programs.
The identification of unknown television segments generally requires very different processes between the identification of video and the identification of audio. This is due to the fact that video is presented in discreet frames and audio is played as a continuous signal. However, in spite of differences in presentation format, said video systems compress video segments to representative fingerprints and then search a database of known video fingerprints in order to identify said unknown segment similar to the identification process of audio. Said video fingerprints can be generated by many means but generally the primary function of fingerprint generation requires the identification of various video attributes such as finding image boundaries such as light to dark edges in a video frame or other patterns in the video that can be isolated and tagged then grouped with similar events in adjacent video frames to form the video fingerprint.
In principle, systems that identify video segments should be built using the same processes to enroll known video segments into a reference database as used to process unknown video from a client means of a media matching service. However, using the example of a smart TV as said client means, several problems arise with sampling the video arriving at the television using the processing means of the smart TV. One such problem arises from the fact that the majority of television devices are connected to some form of set-top device. In the United States, 62% of households subscribe to cable television service, 27% subscribe to satellite TV and a growing number of TV are fed from Internet connected set-tops. Less than 10% of television receivers in the U.S. receive television signal from off-air sources. In the case of set-tops providing television signals to the television set, as opposed to viewing television from off-air transmissions via an antenna, the set-top will often overlay the received video picture with a locally generated graphic display such as program information when a user presses an ‘Info’ button on the remote control. Similarly, when the user requests a program guide, the TV picture will be typically shrunk to a quarter-size or less and positioned in a corner of the display surrounded by the program guide grid. Likewise, alerts and other messages generated by a set-top can appear in windows overlaying the video program. Other forms of disruptive video distortion can occur when the user chooses a video zoom mode which magnifies the picture or a stretch mode when the user is viewing a standard definition broadcast but wishes the 4:3 aspect ratio picture to fill a high-definition television 16:9 screen. In each of these cases, the video identification process will fail in matching the unknown video sampled from said set-top configurations.
Hence, existing automated content recognition systems that rely on only video identification will be interrupted when a number of common scenarios arise, as outlined above, that alter the video program information by an attached set-top device. Yet further problems arise with identifying video even when video is not altered by a set-top device. For example, when a video picture fades to black or even when the video image is portraying a very dark scene, the prior art of video identification systems can lose the ability to identify the unknown video segment.
Interestingly, the audio signal of a television program is almost never altered but conveyed to the television system as received by a set-top device attached to said TV. In all of the above examples of graphics overlays, of fades to black or dark video scenes, the program audio will continue to play usually unaltered and hence be available for reliable program segment identification by means of a suitable automated content recognition system for audio signals. Hence, there is a clear need for an automated content recognition system that utilizes audio identification either alone or in addition to identifying video for the purposes of identifying unknown television program segments. However, the technology employed by the above mentioned music identification systems, such as Shazam, are not generally suited for identification of continuous content such as a television program. These mobile phone music identification apps are typically designed to process audio from a microphone exposed to open air which also imports significant room noise interference such as found in a noisy restaurant or bar. Also, the mode of operation of these above-mentioned audio identification applications is typically based on presumptive ad hoc usage and not designed for continuous automated content recognition. Hence, because of the many technical challenges of identifying audio from high interference sources, the technical architecture of ad hoc music ID programs is not suitable for continuous identification of audio. Said systems would suffer further from operating not only continuously but with very large numbers of simultaneous devices, such as a national or even regional population of television set-tops or smart TVs.
Many uses exist for identifying television programming as it is displayed on a television receiver. Examples include interactive television applications where a viewer is supplied supplemental information to the currently displaying TV program often in the form of a pop-up window on the same TV display from which media is identified or on a secondary display of a device such as a smartphone or tablet. Such contextually related information usually requires synchronization with the primary programming currently being viewed. Another application of detecting television programming is advertisement substitution also known as targeted advertising. Yet another use exists for media census such as audience measurement of one or more television programs. All of these uses and others not mentioned benefit from timely detection of unknown program segments. Hence, continuous audio identification alone or in concert with video identification can provide or enhance the reliability and consistency of an automated content recognition system.