Digital compressed video based on the MPEG set of standards has become the predominant choice for delivering and storing broadcast-quality digital video content. The ISO/IEC 13818 suite specifies the widely deployed MPEG-2 standard, while ISO/IEC 14496 part 10 (also known as ITU-T H.264) specifies the increasingly popular MPEG-4 AVC video coding standard which provides for a much improved coding efficiency. In general, MPEG distinguishes between the compression layer, responsible for coding the raw video and associated audio signals, and the systems layer, responsible for the carriage, synchronization and timing of multiple such compressed audio/video signals. It is common to find MPEG-4 AVC compressed video carried with the MPEG-2 systems layer. We describe this invention in terms of MPEG-2 systems constructs with the knowledge that it is equally applicable to other similarly layered digital video.
In a typical MPEG-2 encoder, the compression layer receives a periodic sequence of frames (also referred to as pictures) of uncompressed digital video, and its associated audio, along with other system information such as program information data commonly known as EPG (Electronic Program Guide). Audio and video signals are first compressed at the source coding level into elementary streams, which are then converted into packetized elementary streams (PES) where decoding and presentation timestamps are inserted for ease of system level multiplexing and presentation.
Broadly speaking, an MPEG-2 compressed video stream is generated in two stages, the first in the so called pixel domain and the next in the frequency domain. In the first stage, a technique referred to as “motion compensated prediction” is applied to the video frames in order to remove temporal redundancy across frames. Specifically, the frames get sub-divided into three types, intra-coded or I frames, forward predicted or P frames, and bi-directionally predicted or B frames. An I frame is self-contained and may be decoded independently by the receiver, whereas P and B frames only contain differential information with respect to other frames. Consequently, the latter are expected to be significantly smaller in size than I frames. In the next stage, blocks of pixels of each frame are transformed into the “frequency domain” using the Discrete Cosine Transform (DCT) operation in order to remove the spatial redundancy within the frame. The result of this step is a set of DCT coefficients. To further reduce the amount of bits required to represent each frame, the DCT coefficients are quantized (by dividing each value using a pre-determined scale, with a finer-grain quantization corresponding to a higher video quality) and the quantized set is subject to a run length coding operation. The result of run-length coding along with coding modes and motion vectors is further coded with entropy coding techniques such as Huffman coding and arithmetic coding.
Separate compression layers are responsible for generating one or more corresponding audio elementary streams (e.g., one per dubbed language). In addition, a video program may also contain one or more data elementary streams, e.g., to carry program-specific information.
The MPEG-2 systems layer multiplexes several compressed elementary streams (audio, video and data), belonging to one or more video programs, into a single “transport stream” (TS), suitable for storage and network transmission of the program(s). In addition to multiplexing, the systems layer performs several roles including packetization of the compressed signals, stream synchronization and timing control, all of which are relevant to the present invention. Shown in FIG. 1, compressed audio and video frames 105A, 105B and 110, along with other data such as program association table (PAT) data and program map table (PMT) data, are carried in transport packets of fixed length, each consisting of 188 bytes (e.g., video transport packet 120, audio transport packet 130, non-audio/video transport packet 140). For each of the transport packets 120, 130 and 140, the first 4 bytes are the fixed transport header bytes. The first byte of the header is hard coded (0x47) as a synchronization byte. The other three bytes include a 13-bit packet identification (PID) field. This field is used for multiplexing different audio, video and other type of data such as PAT, PMT. The MPEG-2 standard pre-allocates some PIDs for fixed purposes. PID 0 is used for PAT. PID 8191 is used for NULL packets, which are used for padding. One of the fields in the TS header is the Payload Unit Start Indicator (PUSI), which identifies the start of a video frame (e.g., video frame 105A) or audio frame (e.g., audio frame 110). A MPEG-2 transport stream can contain multiple programs, with the PAT specifying each program's PMT PIDs. The PMT of each program will specify what PIDs are used for that program, and each PID's usage (audio, video, and others). Note that a program can contain multiple audio streams but typically only one video stream. Since there is only one NULL PID, NULL packets do not belong to any particular program but to the whole transport stream. Of course, if there is only one program contained in a bitstream, then the NULL packets can be considered to belong to that program.
The MPEG-2 encoder communicates a time-base, referred to as a Program Clock Reference (PCR), to the receiver via a field in the TS packet header. Not all TS packets carry a PCR value; however, we may assume an implicit PCR associated with each packet. A PCR value denotes the relative departure time of the packet at the sender. The systems layer assumes a constant-delay transmission network and relies on independent means to compensate for delay jitter in the network, if any. Consequently, the PCR also denotes the relative arrival time of the packet at the receiver. As MPEG-2 is primarily designed for an open-loop network, the sequence of incoming PCR values is used by the receiver to lock its clock to that of the sender, so as to maintain an identical frame period as at the input to the encoder, and thereby also to avoid buffer underflow and overflow with respect to the incoming stream. In order to control and synchronize (e.g., to maintain lip sync) the presentation time of each audio and video frame in the multiplex, the encoder communicates a Presentation Time-Stamp (PTS) with each frame. In addition, to provide for correct decoding of bi-directionally predicted frames, the systems layer at the encoder sends frames in decoding order (as opposed to presentation order), and communicates a Decode Time-Stamp (DTS) with each frame, when the DTS is different from the PTS of that frame. A compliant MPEG-2 receiver essentially receives TS packets belonging to a frame at their indicated (or implicit) PCR values and buffers them temporarily. A frame is removed from the buffer and decoded at its specified DTS value, and is presented to the viewer at its PTS value. Note that the standard assumes an idealized model where decoding time is zero, which is not true in practice. Hence in a real world implementation, more buffering is required to compensate for the non-zero decoding time.
Due to the nature of video signals and the usage of I, P and B frame coding, the size of a compressed video frame can vary significantly. On the other hand, a typical communication channel has fixed bandwidth. In order to carry the variable sized compressed video frames over a fixed bandwidth channel, a buffer is typically used between the output of the video encoder and the input of the channel to smooth the bitrate variation. Larger frames are transmitted over a longer time interval as opposed to the smaller ones, yielding a variable frame rate of departure (and arrival at the receiver). In order to help the receiver re-construct the fixed frame rate of presentation, while maintaining an open-loop network model, the encoder maintains and controls a model of the receiver buffer, called the Video Buffering Verifier (VBV). Typically, the receiver buffer is controlled by assigning a DTS value to a frame such that the sum of the delays experienced by each frame in the sender and receiver buffers is a constant. The size of the receiver buffer is referred to as the VBV buffer size, which is communicated in the sequence header of the elementary stream, while the amount of time each frame spends in the buffer is referred to as its VBV delay, which equals the difference between the arrival time of its first bit and its DTS value. As long as a receiver adheres to the VBV delay of every frame of a sequence, the presentation can proceed at the frame rate of the original video source without danger of underflow or overflow at the receiver until the end of that sequence.
On the other hand, each audio frame is typically compressed to a constant number of bits and hence there is no need to buffer data between the encoder and the decoder. In order to achieve synchronized presentation of the audio and video, the buffering delay for audio and video are significantly different. As a result of this, the audio and video data at the same stream position have PTS values which are far apart (in the range of multiples of hundred milliseconds) as shown in FIG. 2. In the Arrival Time (which is the same as transmission time) program 201, video frame N 210A is adjacent to audio frame M−2 220A and audio frame M−1 220B. However, in the Presentation Time program 202, video frame N 240 is presented at approximately the same time as audio frame M 250. Hence in the bitstream, there is a time lag between video frame N 210A and audio frame M 220B. This time lag is referred to as the audio-video lag 230. The resulting effect is commonly referred to as the audio-video lag problem, and, as described in more detail below, this causes problems for video stream splicing. To make things worse, there are typically more than one audio streams associated with a given video stream to support multiple languages. These multiple audio streams are not necessarily aligned to each other.
Splicing refers to the back-to-back concatenation of two digital video streams in order to create a single continuous stream. The last frame of the first stream is referred to as the “out-point” frame, while the first frame of the second stream is referred to as the “in-point” frame. In terms of bits of a MPEG TS, we may refer to the last bit of the first stream as the out-point, and the first bit of the second stream as the in-point. A splice is said to be “seamless” if, for example, the resultant stream is both syntactically correct (i.e., adheres to the MPEG stream syntax in the case of compressed MPEG video) and free of audio and visual glitches. A visually seamless splicing operation ensures that the VBV buffer does not overflow or underflow, the stream does not violate the capacity of the communications channel, a fixed frame rate of presentation can be maintained in the resultant stream, and audio and video synchronization is not lost during the transition. Note that if bits are lost either due to a VBV overflow or a violation of channel capacity, such loss may result in a long lasting glitch if the affected frame(s) is referenced by other prediction-based frames. If the VBV buffer underflows, such underflow results in a “freeze-frame,” wherein a single frame is presented for more than one frame period. Moreover, any underflow represents lost time, which may be critical in the case of live video streams.
A common and increasingly important application of splicing is Digital Program Insertion (DPI). FIG. 3 shows a network 300 with DPI, as illustrated in the art. The network 300 includes a network encoder 302, an ad server 304, a DPI system 306, and a decoder 308. The ad server 304 includes one or more storage devices (e.g., storage device 310 and storage device 312). The DPI system 306 includes a splicer 314. The splicer 314 can be, for example, a splicing apparatus belonging to the operator of network 300 (e.g., a regional telecommunications service provider or a cable multiple system operator (MSO)). A DPI system 306 receives a network stream 316 for distribution to its customers (e.g., through the decoder 308). The network stream 316 can be a broadcast channel. The network stream 316 includes embedded “cue messages” that indicate opportunities for the operator to insert advertisements or other local programs. In response to the cue messages, the splicer 314 sends an advertisement (ad) request 318 to the advertisement (ad) server 304 instructing it to stream an advertisement at a specified time instant. The ad request 318 contains all the necessary information (e.g., channel number, program identifier, opportunity identifier, etc.) for the ad server 304 to determine the exact video stream to serve. The ad server 304 transmits an advertisement (ad) stream 320 to the splicer 314. At the appropriate instant, the splicer 314 switches from the network stream 316 to the ad stream 320 provided by the ad server 304, and back to the network stream 316 at the end of the advertisement (e.g., the end of the ad stream 320). The spliced stream 322 is thus a back-to-back concatenation of multiple video sequences. Commercial splicers typically support several concurrent splicing operations, but do not scale very well. As the amount of customer-specific personalized video streams increases due to the increasing diversity of content and newer video applications such as time-shifted television, and due to the need to better monetize broadcast content by inserting customer-specific advertisements, there remains an unmet need in the industry for splicers that can scale (in terms of concurrent splices) in a cost-effective manner.
Splicing uncompressed digital video streams or analog video streams (e.g., NTSC, PAL) is fairly straightforward. A splicer can easily switch between the out-point frame of the first stream and the in-point frame of the second stream during the vertical blanking interval between consecutive frames. One of the most difficult problems associated with splicing compressed video streams is related to the audio-video lag problem.
FIG. 4 shows an example of a presentation sequence 400 for stream splicing which highlights the audio-video lag problem, as illustrated in the art. The presentation sequence 400 includes a stream one 410, a presentation time of stream one 420, a stream two 430, a presentation time of stream two 440, a spliced stream 450 and a presentation time of the spliced stream 460. Stream one 410 includes video frame N−1 412A, video frame N 412B, video frame N+1 412C, video frame N+2 412D, video frame N+3 412E, audio frame M−2 414A, audio frame M−1 414B, audio frame M 414C, and audio frame M+1 414D. Stream two 430 includes video frame P−1 432A, video frame P 432B, video frame P+1 432C, video frame P+2 432D, video frame P+3 432E, audio frame Q−3 434A, audio frame Q−2 434B, audio frame Q−1 434C, audio frame Q 434D, and audio frame Q+1 434E. Stream One 410 is switched out from video frame N 412B and audio frame M 414C, as they have the same approximate presentation time. Similarly Stream Two 430 is spliced in from video frame P 432B and audio frame Q 434D, as they have the same approximate presentation time. In a simple splicing operation, Stream Two 430 at the beginning of frame P 432B would just be attached to the end of Stream One 410 at the end of frame N 412B. However, due to the audio-video lag issue mentioned above, the packets from the two streams 410 and 430 have to be interleaved (470) during the transition (shown in the figure as the rectangle shaded by forward slashes), in order to maintain the relative packet arrival time for both streams 410 and 430. To make things worse, the two streams 410 and 430 will typically have different audio-video lags, and hence the spliced stream 450 will either have an audio gap or overlapped audio data if the video is made to be seamless. FIG. 4 shows the case of audio gap 480 (shown as the rectangle shaded with back slashes).
Another problem with splicing is a VBV delay mismatch. The out-point VBV delay and in-point VBV delay typically are different from each other, which can lead to either decoder buffer overflow or decoder buffer underflow. Yet another issue with splicing is the identification of in-point and out-point. Not every video frame can be an in-point and/or out-point in order to achieve seamless splicing. An in-point frame has to be a random access point, and all frames after the in-point frame, in decoding order, cannot use data before the in-point frame for prediction. An out-point frame has to be a frame such that the presentation of video will have no gap before the splicing point.
There are several different approaches in the prior art to deal with the above problems. As shown in FIG. 5, a common solution is to uncompress both streams at the splicer, concatenate in the pixel domain for video, and audio samples for audio, and compress the resultant stream. While this solution can address all of the above problems with great flexibility, it is extremely compute intensive as it involves full decode and encode operations, and does not scale in a cost-effective manner. In order to achieve splicing at a relatively large scale, streams have to be concatenated in the bitstream domain. A simple stream concatenation will cut off the audio by the audio-video lag amount in the old stream, while in the new stream, there will be no audio for the audio-video lag amount. Hence depending on the decoder's handling of such spliced streams, two audio-video presentation scenarios can happen. In one scenario, the trailing audio of the old stream will be lost, while the new stream will start with audio-only playback. In the other scenario, the trailing audio of the old stream will be played back along with the starting part of the new stream. Either scenario is not ideal in terms of audio-video quality. To solve the VBV mismatch problem, a dynamic transmission rate approach can be used to make the VBV delay match without modifying the bitstream itself. While this approach maintains the best picture, it cannot be applied in all network configurations. Another approach to solve the VBV delay mismatch issue, commonly found in high-end commercial splicers, is to perform partial decode and encode operations in the frequency domain via a technique known as “transrating.” Specifically, the DCT coefficients around the splice point are retrieved and re-quantized in order to reduce the amount of bits and eliminate any potential overlap in arrival times. While less compute intensive with respect to the full-decoding/re-encoding approach, transrating remains a bottleneck in scaling a compressed video splicer if it is done for all sessions. A final approach, which is the basis of the SMPTE 312M standard, is to pre-condition the splice points from the source of the bitstream so as to prevent any possibility of arrival overlap. This approach found limited favor in the industry due to the difficulty in pre-conditioning streams so as to allow for the worst-case overlap scenario.