When delivering audio and video content over a transmission channel with either fixed or variable bit rate, one goal is to ensure audio video synchronization and the enablement of advanced use-cases such as splicing.
Audio and video synchronization and alignment has been a crucial part when building audio video systems. Normally, audio and video codecs are not using the same frame duration. Due to this reason, today's audio codecs are not frame aligned. As an example, this is also true for the widely used AAC-family. The example is based on the DVB standard, where a 1024 frame size and a sampling frequency of 48 kHz are used. This leads to audio frames with a duration of
            1024      ⁢                          ⁢      samples              48000      ⁢                          ⁢      Hz        ≈      0.0213    ⁢                  ⁢          sec      .      In contrast the common DVB refresh rate for video is either 25 Hz or 50 Hz, which leads to video frame durations of 0.02 sec or 0.04 sec respectively.
Especially when changing the configuration of the audio stream or changing the program, the video and audio need to be aligned again. Today's systems will change the audio configuration slightly before or after the corresponding video because human beings are not able to recognize small differences in audio and video synchronization.
Unfortunately this increases the complexity of splicing where a national advertisement gets replaced by a local one, since the replaced video stream has to begin also with this small offset. In addition new standards are asking for a more accurate video and audio synchronization to improve the overall user experience.
Therefore recent audio codecs can deal with a wide range of possible frame sizes to match the video frame size. The problem here is that this—besides solving the alignment problem—has a big impact of coding efficiency and performance.
Streaming in broadcast environments imposes special problems.
Recent developments have shown that “adaptive” streaming is considered as a transport layer even for linear broadcast. To match all requirements which are slightly different for over the top application and over the air application adaptive streaming has been optimized. Here we will focus on one concrete adaptive streaming technology but all given examples will also work for other file-based technologies like MMT.
FIG. 7 shows a proposal for the ATSC 3.0 standard which is currently under development. In this proposal, an optimized version of MPEG-DASH is considered to be used over a fixed rate broadcast channel. Since DASH was designed for a variable rate, unicast channel, like LTE, 3G or broadband Internet, some adjustments were needed which are covered by the proposal. The main difference to the regular DASH use-case is that the receiver of a broadcast channel has no backchannel and receives a unicast. Normally the client can extract the location of the initialization segment after receiving and parsing of the MPD. After that the client is able to decode one segment after the other or can seek to a given timestamp. As shown in the above figure, in a broadcast environment this approach is not possible at all. Instead the MPD and the initialization segment(s) is/are repeated on a regular basis. The receiver is then able to tune-in as soon as it receives the MPD and all needed initialization segments.
This involves a tradeoff between short tune-in time and small overhead. For a regular broadcaster a segment length of approx. 1 second seems to be feasible. This means that between two MPDs there is one audio and one video segment (if the program contains only audio and video) both with a length of approx. one second.
For audio and video alignment the former mentioned aspect is also true when using DASH. In addition audio segments have to be slightly longer or shorter to keep audio and video alignment. This is shown in FIG. 8.
If an audio or video configuration change is triggered. This change has to happen at a segment boundary, since there is no other way to transmit an updated initialization segment. For that, video and audio are padded (with either black frames or silence) to fill a full segment. But this doesn't solve the issue of misalignment of video and audio. For splicing and program changes, there can be a small audio and video mismatch depending on the current segment duration drift.