Splicing video and audio has been a practice used for decades in the video broadcasting industry. Splicing allows for advertisements, or other locality specific content, to be inserted in the broadcast stream for consumption by a discrete (often regional) audience. With the advent of digital broadcast signals, such splicing has become more complex because digital video and audio streams are conventionally separate and distinct streams, prior to modulation to end users. Further, of the two types of streams (audio and video), audio is the most problematic, as discussed below.
Traditional digital audio splicing is done in the un-compressed domain. Thus, in order to complete an audio insertion process (also referred to as “Ad insertion”), the packets of the primary audio stream must be (1) compressed for transmission, (2) de-compressed at the point where audio insertion of the second audio stream (“Ad audio stream”) occurs, and (3) re-encoded and re-compressed, and typically modulated, after Ad insertion for propagation of the resultant signal to a receiver (end-user) device(s).
To alleviate the above-recited inefficient and cumbersome steps, a simple approach to audio splicing in the compressed domain was developed. The approach utilizes an audio decoder buffer, standard in most receiver devices. The audio decoder buffer is given the primary audio stream in normal course from a transcoder-multiplexer (head end equipment) (hereinafter “transmux”). The primary audio stream may be akin to a national feed for a television network, for example. After the transmux transmits the last frame of the primary audio stream needed prior to the Ad insertion, the transmux discontinues sending any of the primary stream. At that point the audio decoder buffer fullness is underflowed, providing a sufficient gap such that when the transmux begins to transmit the Ad audio stream the audio decoder buffer has sufficient temporary storage space to receive it without dropping any packets. The packets are then presented to the end user by the receiver device in accordance with the Presentation Time Stamp (“PTS”) (header stamps in the Packetized Elementary Stream (“PES”)) in each PES packet header, as is commonly known. After the last packet in the Ad audio stream is transmitted, the transmux begins transmitting the primary stream again.
The above approach, however, is prone to producing unpredictable results such as audio distortion due to the buffer underflow in the audio decoder buffer. In order to avoid the underflow, the Ad server stream must be transmitted to the audio decoder buffer before underflow occurs. However, the Ad audio stream is delivered by the Ad server too early relative to the splice time. Hence, the Ad audio stream reaches the audio decoder buffer before the normal arrival time of the last frame from the primary program, hence too many Ad audio packets reach the audio decoder buffer before the last frame from the primary program has been pulled out and presented to the end users. In this instance, the audio decoder buffer would not have enough space for the Ad audio stream and would overflow and begin dropping packets of the Ad audio stream. Then, when the Ad audio stream is presented to the end-user in accordance with the PTS, the dropped packets will obviously cause all types of undesirable audio distortion.
Unfortunately, this early delivery is quite common in conventional digital broadcast systems. DVS 380, a digital video/broadcast delivery standard, specifies that the Ad audio stream should arrive at the splicer 300 to 600 milliseconds earlier than the splice time (insertion time). The early audio stream delivery forces the Ad server to send the first several frames of the audio stream at a reduced transmission bit rate (slower bandwidth) than the normal audio transmission rate in order to prevent the audio buffer from overflowing. However, this will only prevent the audio buffer from overflowing if the audio buffer is allowed to empty prior to the start of the Ad server audio stream. Unfortunately, the 300 millisecond (or greater) early delivery is too large for a fixed bit rate audio stream if decoder buffer underflow is not allowed prior to the start of the Ad audio stream. For example, an audio stream transmitted at 192 kilobits per second corresponds to a maximum frame transmission delay of 149 milliseconds, given an audio buffer size of 3,584 bytes. A frame transmission delay is defined as the time duration from the transmission time of the current frame to its presentation time. It is also equal to the decoder buffer delay as defined later. Thus, earlier delivery of an Ad audio stream of 300 milliseconds is far above the maximum frame transmission delay and as a result, the audio decoder buffer may overflow. Further, if the Ad audio stream is delayed by a fixed time to avoid the overflow, an underflow may occur, because the start of the audio stream is transmitted too slow. The above described audio decoder buffer underflow and overflow results in audio distortion.
Thus, what is needed is a system and method to eliminate buffer underflow or overflow when an Ad audio stream is delivered at a variable time prior to the insertion time and/or at a variable bit rate.