Digital watermarking is type of signal processing in which auxiliary message signals are encoded in host content, such as image, audio or video signals, in a manner that is imperceptible to humans when the content is rendered. It is used for a variety of applications, including, for example, broadcast monitoring, device control, asset management, audience measurement, forensic tracking, automatic content recognition, etc. In general, a watermarking system is comprised of an encoder (the embedder) and a compatible decoder (often referred to as a detector, reader or extractor). The encoder transforms a host audio-visual signal to embed an auxiliary signal, whereas the decoder transforms this audiovisual signal to extract the auxiliary signal. The primary technical challenges arise from design constraints posed by real world usage scenarios. These constraints include computational complexity, power consumption, survivability, granularity, retrievability, subjective quality, and data capacity per spatial or temporal unit of the host audio-visual signal.
Despite the level of sophistication that commercial watermarking technologies have attained, the increasing complexity of audio-visual content production and distribution, combined with more challenging use cases continue to present significant technical challenges. Distribution of content is increasingly “non-linear” meaning that audio-visual signals are distributed and then redistributed within the supply chain among intermediaries and consumers through myriad of different wired and wireless transmission channels and storage media, and consumed on a variety of rendering devices. In such an environment, audio and visual signals undergo various transformation that watermark signals must survive, including format conversions, transcoding with various compression codecs and bitrates, geometric and temporal distortions of various kinds, layering of watermark signals and mixing with other watermarked or un-watermarked content.
Encoding of watermarks at various points in the distribution path benefits from a scheme for orchestrating encoding of a watermarks to avoid collision with previously embedded watermark layers. Orchestrating encoding may be implemented, for example, by including a decoder as a pre-process within an encoder to detect a previously embedded watermark layer and execute a strategy to minimize collision with it. For more background, please see our U.S. Pat. Nos. 8,548,810 and 7,020,304, which are hereby incorporated by reference.
While such orchestration is effective in some cases, it is not always possible for a variety of reasons. As such, watermarks need to be designed to withstand overlaying of different watermarks. Additionally, they need to be designed to be layered or co-exist with other watermarks without exceeding limits on perceptual quality.
When multiple watermark layers are potentially present in content, it is more challenging to design encoders and decoders to achieve the above mentioned constraints Both encoding and decoding speed can suffer as encoding becomes more complex and presence of watermark layers may make reliable decoding more difficult. Relatedly, as computational complexity increases, so does power consumption, which is particularly problematic in battery powered devices. Data capacity can also suffer as there is less available channel bandwidth for watermark layers within the host audio-visual signal. Reliability can decrease as the presence of potentially conflicting signals may lead to increases in false positives or false negatives.
The challenges are further compounded in usage cases where there are stringent requirements for encoding and decoding speed. Both encoding and decoding speed is dictated by real time processing requirements or constraints defined in terms of desired responsiveness or interactivity of the system. For example, encoding often must be performed within time constraints established by other operations of the system, such as timing requirements for transmission of content. Time consumed for encoding must be within latency limits, such as frame rate of an audio-visual signal. Another example with stringent time constraints is encoding of live events, in which encoding is performed on an audio signal captured at a live event and then played to an audience. See U.S. Patent Application Publication 20150016661, which is hereby incorporated by reference. Another example is encoding and decoding within the time constraints of a live distribution stream, namely, as the stream is being delivered, including terrestrial broadcast, cable/satellite networks, IP (managed or open) networks, and mobile networks, or within re-distribution in consumer applications (e.g., AirPlay, WiDi, Chromecast, etc.).
The mixing of watermarks presents additional challenges in the encoder and decoder. One challenge is the ability to reliably and precisely detect a boundary between different watermarks, as well as boundaries between watermarked and un-watermarked signals. In some measurement and automatic recognition applications, it is required that the boundary between different programs be detected with a precision of under 1 second, and the processing time required to report the boundary may also be constrained to a few seconds (e.g., to synchronize functions and/or support interactivity within a time period shortly after the boundary occurs during playback). These types of boundaries arise at transitions among different audio-visual programs, such as advertisements and shows, for example, as well as within programs, such as the case for product placement, scene changes, or interactive game play synchronized to events within a program. Due to mixing of watermarked and un-watermarked content and watermark layering, each program may carry a different watermark, multiple watermarks, or none at all. It is not sufficient to merely report detection time of a watermark. Demands for precise measurement and interactivity (e.g., synchronizing an audio or video stream with other events) require more accurate localization of watermark boundaries. See, for example, U.S. Patent Application Publications 20100322469, 20140285338, and 20150168538, which are hereby incorporated by reference and which describe techniques for synchronization and localization of watermarks within host content.
In some usage scenarios, mixing of watermark layers occurs through orchestrated or un-orchestrated layering of watermark signals within content as it moves through distribution. In others, design constraints dictate that a watermark be replaced by another watermark. One strategy is to overwrite an existing watermark without regard to pre-existing watermarks. Another strategy is to decode pre-existing watermark and re-encode it with a new payload. Another strategy is to decode a pre-existing watermark, and seek to layer a subsequent watermark in the host content so as to minimize collision between the layers.
Another strategy is to reverse or partially reverse a pre-existing watermark. Reversal of a watermark is difficult in most practical use cases of robust watermarking because the watermarked audio-visual signal is typically altered through lossy compression and formatting operations that occur in distribution, which alters the watermark signal and its relationship with host audio-visual content. If it can be achieved reliably, partial reversal of a pre-existing watermark enables additional bandwidth for further watermark layers and enables the total distortion of the audio-visual content due to watermark insertion to be maintained within subjective quality constraints, as determined through the use of a perceptual model. Even partial reversal is particularly challenging because it requires precise localization of a watermark as well as accurate prediction of its amplitude. Replacement also further creates a need for real time authorization of the replacement function, so that only authorized embedders can modify a pre-existing watermark layer.
As noted, an application of digital watermarking is to use the encoded payload to synchronize processes with the watermarked content. This application space encompasses end user applications, where entertainment experiences are synchronized with watermarked content, as well as business applications, such as monitoring and measurement of content exposure and use.
When connected with an automatic content recognition (ACR) computing service, the user's mobile device can enhance the user's experience of content by identifying the content and providing access to a variety of related services.
Digital watermarking identifies entertainment content, including radio programs, TV shows, movies and songs, by embedding digital payloads throughout the content. It enables recognition triggered services to be delivered on an un-tethered mobile device as it samples signals from its environment through its sensors.
Media synchronization of live broadcast is needed to provide a timely payoff in broadcast monitoring applications, in second screen applications as well as in interactive content applications. In this context, the payoff is an action that is executed to coincide with a particular time point within the entertainment content. This may be rendering of secondary content, synchronized to the rendering of the entertainment content, or other function to be executed responsive to a particular event or point in time relative to the timeline of the rendering of the entertainment content.
This specification presents approaches for achieving media synchronization at the listening device by building an explicit content timeline based on timing marks embedded in the content, or at a resolver service (e.g., a software implemented service executing on one more servers in the cloud) based on a predetermined timeline. Also, it presents approaches for refining the timeline estimation. The resolver service executes on a server that the listening devices accesses via a network connection. The listening device provides payloads and other context information to the resolver service such as device identifier, attributes, time stamps (e.g., output by local clock on listening device marking time of content capture and/or time stamps extracted from sensed content) and device location (GPS, venue, theater, outdoor event location). The resolver service uses this information to determine the response to provide back to the listening device. This may be secondary content for the user's device to render, or a pointer to and/or instructions on rendering secondary content. The user device renders the secondary content, e.g., in synchronization with sensed content or in synchronization with rendering on other user devices being exposed to the same sensed content (e.g., at a theater, venue, outdoor event where users are exposed to and sense the same content).
The embedded timing marks can be sequential payloads that repeat at regular intervals of time, or they can be a single payload repeating at a predetermined sequence of varying intervals of time (known to at the application or at the resolver service). Along with the timing payloads, the content may also be embedded with content identifying payloads. The listening devices and/or the resolver service use the identifying payloads combined with the content timeline to identify content and localize the content's events and to enable recognition triggered services at the listening devices.
Some use cases require recognition triggered services to be delivered to multiple listening device simultaneously. In this case, the devices are connected to the resolver service. The resolver service uses the timing marks detected by the different listening devices to build a tight estimate of the content timeline and to synchronize the delivery of the recognition triggered services to the listening devices.
Timeline Reconstruction Using Dynamic Path Estimation
This specification also presents technology for constructing a program timeline in real time based on in-line or ambient detection of watermarks in host audio-video content. Audio-video content comprises audio signals, video signals, and content with both audio and video signals, such as movies, TV and like audio-visual sequences with video frames and corresponding audio frames formatted and rendered to be output in synchronization with each other. The timeline construction applies to audio-video content where audio or video watermarks or fingerprints are detected and/or matched with a database of same to provide at least coarse program timing information.
In order for watermarks to survive degradation from transformations in distribution, watermark embedders redundantly encode a watermark signal (the watermark payload) over a space and time. This redundancy sacrifices time resolution for improved robustness. Though the embedder encodes a watermark payload at a fine time resolution, degradation can render instances of the payload undetectable. Yet over time, the detector aggregates detection results and provides a reliable output of the payload. However, the time resolution afforded by an individual instance is lost due to mis-detections and the need to aggregate detection over a longer time window.
To illustrate, consider an example in which a watermark payload of 64 bits is encoded in a duration of audio of 128 milliseconds. For ambient detection, the watermark must be reliably detectable, and thus, sufficiently robust to survive through digital to analog conversion, ambient transmission, detection by microphone, and analog to digital conversion. As such, the watermark payload is repeated over a longer duration of audio, such as three to six seconds. The parameters of payload size, repetition, and duration of audio vary with application requirements, yet this example illustrates that the redundant encoding of the payload provides robustness at the expense of timing resolution. The same is true for payloads redundantly encoded in a sequence of video frames.
One aspect of the invention is a method of timeline reconstruction. A dynamic path estimation method constructs a program timeline in real time from an incoming stream of audio or visual content in which watermark payloads are redundantly encoded. A receiving device buffers a portion of the incoming signal, executes watermark detection on the contents of the buffer, presents a detection result, and then advances the incoming signal in the buffer by a step (referred to as read frequency). Each detection result corresponds to different possible detection paths, as the detector does not reveal the precise position of the watermark payload. The detection paths, in turn, correspond to possible program times. The dynamic path estimation method operates on the detection results to determine a global cost function for each possible detection path. As the incoming audio advances through a detection buffer, the method updates cost values for the possible paths, determines a global cost for the paths, and outputs a program time based on the path of the lowest global cost. The method outputs a program time at each advance of the incoming signal in the buffer, and as such, provides program timeline granularity at finer resolution than the time length of the content segment in the buffer. Further, it provides this program timeline in real time, as the receiving device receives the incoming signal from ambient exposure or in-line reception. This performance capability enables applications that synchronize the rendering of auxiliary content with the incoming signal. It also enables real time tracking of the identity and duration of content that an audience or particular consumer is exposed to.
Further features will be described with reference to the following detailed description and accompanying drawings.