Many client devices that consume online content employ an adaptive bitrate streaming protocol based on an open standard known as Dynamic Adaptive Streaming over HTTP (DASH) to request successive fragments of the content for decoding, rendering, and display. Dynamic DASH refers to the consumption of live streaming content. Dynamic manifest data are provided to the client in the form of one or more XML files that provide the client with the information it needs to generate properly formatted requests for the audio, video, and subtitle fragments of the content. The manifest data typically includes multiple options for video and audio streams, each including video and audio fragment at different resolutions, quality levels, bitrates, languages, etc.
Live streaming content includes primary content that is generated in real time (e.g., live sporting events, live concerts, etc.) and often includes segments of secondary content (e.g., advertisements) that is dynamically inserted on top of the primary content. The secondary content is typically inserted in the place of so-called slates that are inserted (often manually and in real time) as placeholders in the primary content. For example, a slate might be inserted at the source of the live content (e.g., at a football stadium video capture booth) by an operator pushing a button when the slate should begin and releasing or pressing the button again when the slate should end based on what is happening in real time at the event being broadcast (e.g., during a timeout on the field). Given the arbitrary nature of slate insertion, and that secondary content (e.g., ads) inserted in such slate periods originates from other sources (e.g., an ad exchange), it is typically the case that the inserted secondary content is not of the same duration as the slate it replaces. This may be understood with reference to FIG. 1.
The diagram in FIG. 1A illustrates the situation in which insertion of the secondary content (represented by video fragment V1 and audio fragment A1 of content period n) results in a gap between the end of the secondary content and the beginning of the next segment of primary content (represented by video fragment V2 and audio fragment A2 of period n+1). This gap is represented in the dynamic manifest data which includes metadata for each fragment that specifies its presentation time relative to a media timeline associated with the overall media presentation. That is, for example, the presentation time of V2 in the manifest data is determined by the duration of the preceding slate into which the secondary content was inserted. However, because the inserted secondary content is shorter in duration, there is a corresponding gap between the end of V1 and the beginning of V2, as well as between the end of A1 and the beginning of A2. Further, note that because of differences between audio and video encoding techniques, corresponding fragments of content are not identical in length as illustrated by the different ending and starting points of corresponding fragments V1 and A1, and V2 and A2, respectively. Media players employing the Dynamic DASH protocol are expected to handle such offsets.
The video renderer of a media player relies on the presentation times in the manifest metadata, while the audio renderer does not. Therefore, when a media player encounters such a gap, the frames of the succeeding video fragment, e.g., V2, will not be displayed until their corresponding presentation times in the media timeline. This might show up on the display as a “freezing” of the video on the last frame of the preceding fragment, e.g., V1, or presentation of a blank screen until the presentation time of V2 arrives. By contrast, audio renderers typically employ a “free fall” model that does not pay attention to the presentation times associated with audio samples, simply decoding and playing them back in sequence as they become available according to the audio encoding scheme's sample rate, bit rate, etc. The time stamps for video frames are used by the video renderer and are matched against the audio renderer's playhead to determine when to render, hold or drop a video frame. But because rendering and playback of the samples of fragment A2 begins immediately following the last sample of fragment A1, this effectively shifts audio fragment A2 earlier in the media timeline, i.e., to the left in FIG. 1A by an amount represented by the duration of gap g; causing the audio to lead the video.
The diagram in FIG. 1B illustrates the situation in which insertion of the secondary content (represented by video fragment V3 and audio fragment A3 of period m) results in an overlap of the end of the secondary content and the beginning of the next segment of primary content (represented by video fragment V4 and audio fragment A4 of period m+1). In this example, because of the “free fall” model employed by the audio renderer, audio fragment A4 is effectively shifted to the right relative to video fragment V4 by the duration of overlap o, causing the audio to lag the video.
As will be appreciated with reference to these examples, if the media player is not equipped to handle these gaps or overlaps at the transitions between primary and secondary content, the synchronization between audio and video can be lost; potentially with the effect being amplified over time as the effects of such gaps and overlaps accumulate.
One approach to handling this is to simply flush the renderer stack each time a boundary between primary and secondary content is encountered. However this is not the optimal solution because it can increase the chance of re-buffers as the renderer stacks are replenished. It also causes the media player to drift away from the live playhead of the primary content because of the additional time it takes to fill the renderer buffer after a flush. If the playback of the media player, i.e., the client playhead, lags too far behind the live playhead, this can result in a negative viewer experience. Another approach avoids the need to handle such discontinuities by using two media players; one to handle playback of the primary content, and one to handle playback of the secondary content, and switching between the two players. However, running two media players can be wasteful of processing resources and may be characterized by unacceptable latencies when switching between the two players.