Digital compressed video based on the Moving Picture Experts Group (MPEG) standards has become the predominant choice for delivering and storing video content. The International Organization for Standardization (ISO)/International Electrotechincal Commission (IEC) 13818 suite specifies the widely deployed MPEG-2 standard, while ISO/IEC 14496 specifies the increasingly popular MPEG-4 standard which provides for a much improved coding efficiency and enhanced error robustness. In general, MPEG distinguishes between the compression layer, responsible for coding the raw video and associated audio signals, and the systems layer, responsible for the carriage, synchronization and timing of multiple such compressed signals. It is also common to find MPEG-4 compressed signals carried on top of the MPEG-2 systems layer.
In a typical MPEG-2 encoder, the compression layer receives a periodic sequence of frames of uncompressed digital video and converts it into a video “elementary stream” of compressed frames. Most commonly, the “frame period” is chosen to be approximately 33 milliseconds to generate a presentation sequence of 30 video frames a second. The elementary stream retains the same fixed frame rate as the input sequence, however the frame sizes (in bits) may vary widely.
Broadly speaking, MPEG-2 compression proceeds in two stages, the first in the pixel domain and the second in the frequency domain. In the first stage, a technique referred to as “motion compensation” is applied to the uncompressed video frames in order to remove temporal redundancy across frames. Specifically, the frames get sub-divided into three types: intra-coded frames (I frames), forward predicted frames (P frames), and bi-directionally predicted frames (B frames). An I frame is self-contained and may be decoded independently by the receiver, whereas P and B frames only contain differential information with respect to other frames. Consequently, the latter are expected to be significantly smaller in size than I frames. In the next stage, blocks of pixels of each frame are converted to the “frequency domain” using the Discrete Cosine Transform (DCT) operation in order to remove the spatial redundancy within the frame. The DCT is an extremely compute intensive step, as it involves matrix multiplication operations. The result of this step is a set of DCT coefficients, one for each block of pixels. To further reduce the amount of bits required to represent each frame, the coefficients are quantized (by dividing each value using a pre-determined scale, with a finer-grain quantization corresponding to a higher video quality) and the quantized set is subject to a “run length” coding and “entropy” coding operations. Relatively speaking, the quantization and run length coding operations are less compute intensive. The final output of the compression layer is a periodic sequence of variable-sized I, P and B frames, efficiently encoded in the frequency domain and represented using the syntax specified by the MPEG-2 standard.
Separate compression layers are responsible for generating one or more (e.g., one per dubbed language) corresponding audio elementary streams. In addition, a video program may also contain one or more data elementary streams, e.g., to carry program-specific information.
The MPEG-2 systems layer multiplexes several compressed elementary streams (e.g. audio, video and data streams), belonging to one or more video programs, into a single “transport stream” (TS), suitable for storage and network transmission of the program or programs. In addition to multiplexing, the systems layer performs several roles including packetization of the compressed signals, clocking, stream synchronization, and timing control.
The MPEG-2 encoder communicates a time-base, referred to as a Program Clock Reference (PCR), to the receiver via a field in the TS packet header. Not all TS packets carry a PCR value; however, we may assume an implicit PCR associated with each packet. A PCR value denotes the relative departure time of the packet at the sender. The systems layer assumes a constant-delay transmission network and relies on independent means to compensate for delay jitter in the network, if any. Consequently, the PCR also denotes the relative arrival time of the packet at the receiver. As MPEG-2 is primarily designed for an open-loop network, the sequence of incoming PCR values is used by the receiver to lock its clock to that of the sender, so as to maintain an identical frame period as at the input to the encoder, and thereby also to avoid underflow and overflow with respect to the incoming stream. In order to control and synchronize (e.g., to maintain lip sync) the presentation time of each audio and video frame in the multiplex, the encoder communicates a Presentation Time-Stamp (PTS) with each frame. In addition, to provide for correct decoding of bi-directionally predicted frames, the systems layer at the encoder sends frames in decoding order (as opposed to presentation order), and communicates a Decode Time-Stamp (DTS) with each frame. A compliant MPEG-2 receiver essentially receives TS packets belonging to a frame at their indicated (or implicit) PCR values and buffers them temporarily. A frame is removed from the buffer and decoded at its specified DTS value, and is presented to the viewer at its PTS value.
While the goal of the compression layer is to minimize the amount of bits required to represent the audio/video frames, one of the goals of the systems layer is to efficiently utilize the capacity of a communications channel. An encoder achieves this goal by smoothing the transmission of the variable-sized I, P and B frames through a buffer, and constraining the peak bit-rate of the transport stream. Larger frames are transmitted over a longer time interval as opposed to the smaller ones, yielding a variable frame rate of departure (and arrival at the receiver). In order to help the receiver re-construct the fixed frame rate of presentation, while maintaining an open-loop network model, the encoder maintains and controls a model of the receiver buffer, called the Video Buffering Verifier (VBV). Typically, the receiver buffer is controlled by assigning a DTS value to a frame such that the sum of the delays experienced by each frame in the sender and receiver buffers is a constant. The size of the receiver buffer is referred to as the VBV buffer size, which is communicated in the sequence header of the elementary stream, while the amount of time each frame spends in the buffer is referred to as its VBV delay, which equals the difference between the arrival time of its first bit and its DTS value. As long as a receiver adheres to the VBV delay of the first frame of a sequence (initial VBV delay), the presentation can proceed at the frame rate of the original video source without danger of underflow or overflow at the receiver until the end of that sequence.
Splicing refers to the back-to-back concatenation of two streams in order to create a single continuous stream. The last frame of the first stream is referred to as the “out-point” frame, while the first frame of the second stream is referred to as the “in-point” frame. In terms of bits of a MPEG TS (and as used in this application), the last bit of the first stream can be referred to as the out-point, and the first bit of the second stream as the in-point. A splice is “seamless” if the resultant stream is both syntactically correct (i.e., adheres to the MPEG stream syntax in the case of compressed MPEG video) and free of noticeable visual glitches. At a minimum, a visually seamless splicing operation must ensure that the VBV buffer does not overflow or underflow, the stream does not violate the capacity of the communications channel, and a fixed frame rate of presentation can be maintained in the resultant stream. Note that if bits are lost either due to a VBV overflow or a violation of channel capacity, it may result in a long lasting glitch if the affected frame or frames are referenced by other prediction-based frames. If the VBV buffer underflows, it results in a “freeze-frame,” wherein a single frame is presented for more than one frame period. Moreover, any underflow represents lost time, which may be critical in the case of live video streams.
A common and increasingly important application of splicing is Digital Program Insertion (DPI). FIG. 1 shows a network 100 with DPI, as illustrated in the art. The network 100 includes a network encoder 102, an ad server 104, a DPI system 106, and a decoder 108. The ad server 104 includes one or more storage devices (e.g., storage device 110 and storage device 112). The DPI system 106 includes a splicer 114. The splicer can be, for example, a splicing apparatus belonging to the operator of network 100 (e.g., a regional telco service provider or a cable multiple systems operator). A DPI system 106 receives a network stream 116 for distribution to its customers (e.g., through the decoder 108). The network stream 116 can be a broadcast channel. The network stream 116 includes embedded “cue messages” that indicate opportunities for the operator to insert advertisements or other local programs. In response to the cue messages, the splicer 114 sends an ad request 118 to the ad server 104 instructing it to stream an advertisement at a specified time instant. The ad request 118 contains the necessary information (e.g., channel number, program identifier, opportunity identifier, etc.) for the ad server 104 to determine the exact video stream to serve. The ad server 104 transmits an advertisement (ad) stream 120 to the splicer 114. At the appropriate time, the splicer 114 switches from the network stream 116 to the ad stream 120 provided by the ad server 104, and back to the network stream 116 at the end of the advertisement (e.g., the end of the ad stream 120). The spliced stream 122 is thus a back-to-back concatenation of multiple video sequences. Commercial splicers typically support several concurrent splicing operations, but do not scale very well.
Splicing uncompressed digital video streams or analog video streams (e.g., National Television Standards Committee (NTSC), or Phase Alternating Line (PAL)) is fairly straightforward. A splicer can switch between the out-point frame of the first stream and the in-point frame of the second stream during the vertical blanking interval between consecutive frames. One of the most difficult problems associated with splicing compressed video streams is related to the variable frame rate of arrival at the receiver and the corresponding VBV buffer management.
FIG. 2 shows an arrival sequence 200 for splicing as illustrated in the art. The arrival sequence 200 includes an arrival time of stream one 210, an arrival time of stream two 220, and a decode time of the spliced stream 230. Stream one 212 includes frame N-2 212A, frame N-1 212B, and frame N 212C. Stream two 214 includes frame 1 214A, frame 2 214B, and frame 3 214C. The decode time of spliced stream 230 includes a decode time of stream 1 frame N-2 216A, a decode time of stream 1 frame N-1 216B, a decode time of stream 1 frame N 216C, a decode time of stream 2 frame 1 216D, a decode time of stream 2 frame 2 216E. The arrival sequence 200 includes a frame N VBV delay 218, a frame 1 VBV delay 220, a frame period 222, and an arrival overlap time 224. The arrival overlap time 224 is the difference between the arrival time of the first bit 228 of frame 1 214A and the arrival time of the last bit 226 of frame N 212C. The frame N VBV delay 218 is the time difference between the arrival time of the first bit 230 of frame N 212C and the decode time of stream 1 frame N 216C. The frame 1 VBV delay 220 is the difference between the decode time of stream 2 frame 1 216D and the arrival time of the first bit 228 of frame 1 214A. Stream one and stream two are transmitted at or below peak rate R 232. Time 234 progresses from the left to the right of the splicing process 200.
Each frame (e.g., frame N-2 212A, frame N-1 212B, and frame 1 214A) must adhere to its VBV delay (e.g., the VBV delay of frame N 212C is frame N VBV delay 218), controlled by the encoder, in order to maintain a fixed frame rate of presentation. The VBV delays of consecutive frames in a sequence are related by the initial VBV delay, the frame period, the sizes of the frames and the transmission bit rate. However, in general, the VBV delay of the out-point frame is completely unrelated to the VBV delay of the in-point frame. For example, frame N 212C is the out-point frame and frame 1 214A is the in-point frame, where the frame N VBV delay 218 is completely unrelated to the frame 1 VBV delay 220. To achieve seamless splicing, the in-point frame 1 214A must nevertheless be removed from the decoder buffer (not shown) exactly one frame period (e.g., frame period 222) after the removal of the out-point frame N 212C. In order to eliminate future underflow or overflow possibilities, the receiver must also adhere to the VBV delay of the in-point (e.g., frame 1 VBV delay 220). To provide for the latter, the splicer (e.g., the splicer 114 from FIG. 1) must ensure that the in-point frame 1 214A arrives at the receiver (e.g., the decoder 108) at precisely the decode time of stream 2 frame 1 216D minus frame 1 VBV delay 220. Specifically, the difference in arrival times of the first bits of the out-point and in-point frames must exactly equal the difference in their VBV delays plus the frame period. This may cause an arrival overlap time 224 of some of the trailing bits of the first stream and the initial bits of the second stream, resulting in a violation of the bit-rate of the channel and an overflow of the receiver buffer.
There are several approaches in the art to deal with the above problem. A common solution is to delay the arrival of the first bit of the second stream (e.g., the first bit of frame 1 214A of stream two 214), if necessary, so as to eliminate the arrival overlap (e.g. the arrival overlap time 224). However, this may cause an underflow in the VBV buffer and a freeze-frame artifact, thereby rendering the splice non-seamless. A second approach is to uncompress both streams at the splicer, concatenate in the pixel domain, and compress the resultant stream, essentially re-creating the VBV buffer model for the spliced stream. This solution is extremely compute intensive, as it involves full decode and encode operations, and does not scale very well. A third approach, commonly found in high-end commercial splicers, is to perform partial decode and encode operations in the frequency domain via a technique known as “transrating.” Specifically, the DCT coefficients around the splice point are retrieved and re-quantized in order to reduce the amount of bits and eliminate any potential overlap in arrival times. While less compute intensive with respect to the second approach, transrating remains a bottleneck in scaling a compressed video splicer. Moreover, transrating compromises video quality around the splice point. A final approach, which is the basis of the Society of Motion Picture and Television Engineers (SMPTE) 312M standard, is to pre-condition the splice points so as to prevent any possibility of arrival overlap. This approach found limited favor in the industry due to the difficulty in pre-conditioning streams so as to allow for the worst-case overlap scenario.