The present invention relates to a method and system for editing video bit streams. In particular, the present invention relates to a method and system for encoding transitions between encoded video segments so that the ending buffer fullness of one segment is sufficiently large so that there will not be a buffer underflow while the following segment is decoded.
A video encoder system 10 is illustrated in FIG. 1. The system 10 includes a source of video 12, a preprocessor 14, a video encoder 16, a rate buffer 18 and a controller 20. The source 12 of video is, for example, a video camera, or a telecine machine which converts a sequence of film images into a sequence of video frames, or other device which outputs a sequence of video frames. The frames may be interlaced or progressive. An interlaced frame picture comprises top and bottom field pictures. The preprocessor 14 performs a variety of functions to place the sequence of video frames into a format in which the frames can be compressed by the encoder. For example, in the case where the video source is a telecine machine which outputs 30 frames per second, the preprocessor converts the video signal into 24 frames per second for compression in the encoder 16 by detecting and eliminating duplicate fields produced by the telecine machine. In addition, the preprocessor may spatially scale each picture of the source video so that is has a format which meets the parameter ranges specified by the encoder 16.
The video encoder 16 is preferably an encoder which utilizes a video compression algorithm to provide an MPEG-2 compatible bit stream. The MPEG-2 bit stream has six layers of syntax. There are a sequence layer (random access unit, context), Group of Pictures layer (random access unit, video coding), picture layer (primary coding layer), slice layer (resychronization unit), macroblock (motion compensation unit), and block layer (DCT unit). A group of pictures (GOP) is a set of frames which starts with an I-picture and includes a certain number of P and B pictures. The number of pictures in a GOP may be fixed.
The encoder distinguishes between three kinds of pictures, I, P, and B. The coding of I pictures results in the most bits. In an I-picture, each macroblock is coded as follows. Each 8xc3x978 block of pixels in a macroblock undergoes a DCT transform to form an 8xc3x978 array of transform coefficients. The transform coefficients are then quantized with a variable quantizer matrix. The resulting quantized DCT coefficients are scanned using, e.g., zig-zag scanning, to form a sequence of DCT coefficients. The DCT coefficients are then organized into run, level pairs. The run, level pairs are then entropy encoded. In an I-picture, each macroblock is encoded according to this technique.
In a P-picture, a decision is made to code the macroblock as an I macroblock, which is then encoded according to the technique described above, or to code the macroblock as a P macroblock. For each P macroblock, a prediction of the macroblock in a previous video picture is obtained. The prediction is identified by a motion vector which indicates the translation between the macroblock to be coded in the current picture and its prediction in a previous picture. (A variety of block matching algorithms can be used to find the particular macroblock in the previous picture which is the best match with the macroblock to be coded in the current picture. This xe2x80x9cbest matchxe2x80x9d macroblock becomes the prediction for the current macroblock.) The predictive error between the predictive macroblock and the current macroblock is then coded using the DCT, quantization, scanning, run, level pair encoding, and entropy encoding.
In the coding of a B-picture, a decision has to be made as to the coding of each macroblock. The choices are (a) intracoding (as in an I macroblock), (b) unidirectional backward predictive coding using a subsequent picture to obtain a motion compensated prediction, (c) unidirectional forward predictive coding using a previous picture to obtain a motion compensated prediction, and (d) bidirectional predictive coding wherein a motion compensated prediction is obtained by interpolating a backward motion compensated prediction and a forward motion compensated prediction. In the cases of forward, backward, and bidirectional motion compensated prediction, the predictive error is encoded using DCT, quantization, zig-zag scanning, run, level pair encoding, and entropy encoding.
B pictures have the smallest number of bits when encoded, then P pictures, with I pictures having the most bits when encoded. Thus, the greatest degree of compression is achieved for B pictures. For each of the I, B, and P pictures, the number of bits resulting from the encoding process can be controlled by controlling the quantizer step size. A macroblock of pixels or pixel errors which is coded using a large quantizer step size results in fewer bits than if a smaller quantizer step size is used. Other techniques may also be used to control the number of encoded bits.
After encoding by the video encoder, the bit stream is stored in the encoder buffer 18. Then, the encoded bits are transmitted via a channel 21 to a decoder, where the encoded bits are received in a buffer of the decoder or stored in a recording medium for later transmission to a decoder.
A decoder system 30 is shown in FIG. 2. An encoded video bit stream arrives via the transmission channel 21 and is stored in the decoder buffer 32. The size of the decoder buffer 32 is specified in the MPEG-2 specification. The encoded video is decoded by the video decoder 34 which is preferably an MPEG-2 compliant decoder. The decoded video sequence is then displayed using the display 36.
The purpose of rate control is to maximize the perceived quality of the encoded video sequence when it is decoded at a decoder by intelligently allocating the number of bits used to encode each picture. The sequence of bit allocations to successive pictures preferably ensures that an assigned channel bit rate is maintained and that decoder buffer exceptions (overflow or underflow) are avoided. The allocation process takes into account the picture type (I, P or B) and scene dependent coding complexity. To accomplish rate control at the encoder, the controller 20 maintains a model of the decoder buffer. This model is known as the video buffer verifier (vbv). The vbv is described in detail in Annex C of the MPEG-2 video standard the contents of which are incorporated herein by reference. (MPEG-2 Video Specification, Annex C). Bits are entered into the vbv in a manner which models the transmission channel. Bits are removed from the vbv in a manner which models removal of the bits by the decoder. That is, all the bits belonging to a picture are removed instantaneously. Based on the occupancy or fullness of the vbv, the controller 20 executes a rate control algorithm and feeds back control signals to the encoder 16 (and possibly to the preprocessor 14, as well) to control the number of bits generated by the encoder for succeeding pictures, and for succeeding macroblocks within each picture.
The rate control algorithm executed by the controller 20 controls the encoder 16 by controlling the overall number of bits allocated to each picture. The controller allocates bits to successive pictures to be encoded in the future so that the occupancy of the vbv is controlled thereby preventing exceptions at the decoder buffer 32. The predicted occupancy of the vbv buffer at any time depends on the number of bits which enter the vbv based on a model of the transmission channel and the number of bits removed from the vbv based on the predicted number of bits used to encode each picture.
One conventional rate control algorithm is the MPEG-2 Test Model (TM). The TM rate control is designed to expend a fixed average number of bits per group of pictures (GOP). If too many bits are spent on one GOP, then the excess will be remedied by allocating fewer bits to the next GOP.
From the perspective of vbv buffer occupancy (or fullness), the TM rate control attempts to force the vbv occupancy to the same level at the beginning of each GOP. The vbv buffer occupancy level is pulled to a predetermined level periodically at the beginning of a GOP. The controller receives an indication of vbv buffer occupancy and allocates bits to the succeeding frames such that the desired vbv buffer occupancy is predicted to occur at the end of the GOP. This often means that only a relatively small number of bits can be allocated to code frames which occur near the end of a GOP. To make the bit allocations, the controller 20 assumes that all frames of same type (I, P or B) have the same number of bits.
The actual number of bits used by the encoder 16 to code a frame generally differs from the number of bits allocated by the controller 20. The deviation may be small or may be large, if, for example, there is a scene change and predictive coding cannot be used. The bit allocations provided by the controller for a set of frames are viewed by the encoder as targets which are updated frequently rather than hard and fast requirements. For example, an encoder may respond to an allocation by the controller by increasing or decreasing a quantization step size to increase or decrease the number of encoded bits for a frame. After each particular frame is actually encoded, the allocations for succeeding frames are updated by the controller, based on how many bits are actually used to encode the particular frame.
A non-linear video editing system 40 is shown in FIG. 2A. Video is inputted through a Video Tape Recorder (VTR) 42 or other device to a video compressor 44, compressed, and stored to a device 46 which can be accessed in a non-linear fashion by a device such as a magnetic disk drive. The compressor 44 may be implemented by the Encoder of FIG. 1. Multiple video clips are stored in the non-linear storage device 46. The clips can be edited together into a video sequence which is decoded using the decompressor 45 and outputted to a VTR or other storage device. Alternatively, a compressed version of the edited video may be outputted for decompression and display on a separate system.
The editing of the video is typically performed by an operator using a computer 48 which receives decompressed video from the decompressor 45. Editing operations include adding effects to single clips (such as fade up from black, fade to black, blur, warp, etc.), adding effects to multiple clips (dissolve from one clip to another, wipe from one clip to another, picture-in-picture, etc.), cutting between clips, etc.
Most non-linear editors today use intra-frame only compression techniques, such as motion JPEG. However, because of the greater compression efficiency and interoperability of inter-frame compression standards such as MPEG, the industry is moving towards using inter-frame non-linear editors.
However, using inter-frame compression in a non-linear editor provides the system designer with many challenges. One is limited random-access; a compressed video stream can only be decompressed starting from a random access point (a sequence header followed by an I frame), and typically random access points are not placed at every frame. Also, a video clip can only begin at a so-called xe2x80x9cclosed GOPxe2x80x9d; a random access point where all B frames immediately following the first I frame use backwards-only prediction or intra motion compensation. Moreover, if a bit stream is truncated before a B frame (in encode order), then the decoded bit stream will contain gaps in the display sequence of frames, so a bit stream can only be terminated before a reference frame (in encode order). (Note that the condition that a bit stream can only be terminated before a reference frame in encode order is equivalent to the condition that a bit stream can only be terminated after a reference frame in display order). Finally, when two video clips are spliced together, the resulting clip will in general cause decoder buffer exceptions.
The first two problems are mitigated by placing closed GOPs at regular intervals in a compressed bit stream. Because these are typically not placed at every frame, a non-linear editing system will typically re-encode several frames at each splice point. An example is shown in FIG. 2B, where a transition is created between two video streams. In this example, the end of the video stream xe2x80x9cAxe2x80x9d and the beginning of video stream xe2x80x9cBxe2x80x9d are decoded, and a transition (such as a dissolve) is performed on the decoded frames. The processed frames (the result of the dissolve) are then compressed, and a bit stream is formed by concatenating the compressed version of video A prior to the transition, the compressed version of the transition, and the compressed version of video B after the transition. As shown in FIG. 2B, the compressed transition area will in general include several frames from video A that are not truly part of the transition (because the compressed video stream A can only be truncated before a reference frame in encode order (after a reference frame in display order), and several frames from video B that are not truly part of the transition (because the compressed part of video B that is kept must start at a closed GOP).
Recompression is also required when an effect is added to part of a video sequence, as shown in FIG. 3. Again, extra frames are taken away from the pre-effect part of the sequence AO and the post-effect part of the sequence A1 and added to the compressed effect clip due to the conditions on ending and starting a bit stream.
Even when bit streams are concatenated so that the last display frame in a first bit stream is a reference frame and the second bit stream begins with a closed GOP, and even if each individual bit stream would play back smoothly, the concatenated bit stream may not play back smoothly. The reason is that the decoder buffer, may underflow or overflow when playing back the concatenated bit stream.
To understand how a decoder buffer may underflow or overflow, consider an MPEG-2 bit stream. Annex C of the MPEG-2 video specification describes how a hypothetical decoder buffer (the xe2x80x9cvbvxe2x80x9d described above) behaves when decoding a bit stream, and requires that bit streams be constructed so that when this hypothetical decoder decodes them its buffer will neither underflow nor overflow. It is understood that actual decoders will behave slightly differently, but the constraints on the bit streams (that the hypothetical decoder neither underflow nor overflow) will be used to make sure that the buffers in real decoders neither overflow nor underflow.
The hypothetical decoder described by the MPEG-2 specification periodically removes compressed pictures from its buffer and decodes them. The first picture is removed a fixed amount of time after it begins to enter the decoder buffer. It the bit stream is a constant bit rate bit stream, bits enter at a constant rate. This situation is illustrated in FIG. 4 which is a plot of vbv fullness (i.e., vbv occupancy) as a function of time. The size of the hypothetical decoder buffer and the bit rate are all coded into the compressed bit stream. Also, the amount of time that the decoder waits from when it starts putting a picture into this buffer until it removes the picture from its buffer, called in MPEG-2 syntax the vbv_delay, is also coded into the bit stream for each picture. Note that for a constant bit rate stream, the number of bits in the decoder buffer before a picture is removed is proportional to the vbv_delay for that picture (the bit rate is the constant of proportionality). Many techniques are known to make sure that encoders produce bit streams that do not result in decoder buffer underflows or overflows; these techniques generally control the number of bits used by varying the quantization level between pictures and within pictures. Note that for a reasonably large decoder buffer, this model allows for wide variations in the number of bits in each frame, so I frames can use many more bits than P and B frames (as typically occurs), and unexpected variations in sequence content can be accommodated. For example, if the video suddenly gets more complex and the encoder uses more bits than planned on a particular picture (but not so many to cause a decoder buffer underflow), then the encoder slightly reduces the number of bits used in the next several pictures to make up for these xe2x80x9clostxe2x80x9d bits.
The buffer fullness just before a picture is removed from the decoder can be calculated from its vbv_delay; it is equal to the vbv_delay times the bit rate. Thus, if two bit streams are to be spliced, the decoder buffer can be prevented from underflowing or overflowing if the decoder buffer fullness one picture time after the last picture of the first bit stream is removed from the decoder buffer equals the buffer fullness just before the first picture from the second bit stream is removed from the buffer (which can be determined from its vbv_delay). Thus, the encoder matches buffer fullness to make a seamless splice.
Conventional methods for controlling the bit rate of encoded material are rather imprecise, which when not splicing is fine. For example, if the encoder allocates a number of bits to a picture, and if it spends those bits at quantization step size, the picture might wind up using many more bits than allocated, but because of the large decoder buffer, these bits can be slowly made up (as mentioned). The quickness of the reaction can be increased, but this can lead to poor quality video (due to big variations in quantization step size).
When matching decoder buffer fullness, conventional rate control techniques will often lead to very poor quality video. Because the ending buffer fullness is set, reactions must be very fast (both between pictures and within pictures), so that too many bits will not be used (which would cause the resulting decoder buffer fullness to be too low). Using too few bits is not a problem, because zero stuffing can be used to xe2x80x9cmakexe2x80x9d more bits. Accordingly, if even slightly more bits then planned are used on part of a picture, the encoder will start to rapidly increase quantization step size. If slightly fewer bits then planned are used on anther part, the quantization step size will be reduced, but the damage to picture quality will have been done. (The same situation occurs between pictures).
The MPEG-2 syntax allows for variable bit rate encoding at well, which is signaled by setting each vbv_delay to a particular value (i.e., 65535/90000 sec.) which is not used for constant bit rate. When variable bit rate is used, the decoder buffer fills up completely, and then each picture is removed periodically. If the decoder buffer is not full, bits enter the decoder buffer at the maximum bit rate (specified in the bit stream), but when the decoder buffer is full no bits enter. With this mode, the encoder does not have to worry about decoder buffer overflow, but it does have to worry about decoder buffer underflow. The decoder buffer behavior for variable bit rate is illustrated in FIG. 5.
When this MPEG-2 variable bit-rate syntax is used, if two bit streams are spliced together the resulting bit stream will not cause decoder buffer overflows. This is true because no bit streams can cause decoder buffer overflows. However, if the ending buffer fullness of the first stream is too low, the resulting spliced stream may cause a decoder buffer to underflow. Unlike the case for constant bit rate, there is no indication in the bit stream about what the decode buffer fullness will be in fact, i.e., there cannot be because the actual buffer fullness before a picture is removed depends on when encoding began. Consider a bit stream with three pictures; picture 1 is 500,000 bits, and pictures 2 and 3 are 200,000 bits, and assume a bit rate of 100,000 bits/sec and a decoder buffer size of 1,000,000 bits. If we start decoding from picture 1, the buffer fullness just before picture 3 is removed is 1,000,000xe2x88x92500,000+100,000xe2x88x92200,000+100,000=500,000. But if we start decoding from picture 2 it is 1,000,000xe2x88x92200,000+100,000=900,000. Therefore, splicing into a variable bit rate bit stream requires either ensuring that the ending buffer fullness of the stream before the splice is full (which can cause poor quality because few bits are used) or calculating, from the size of every picture remaining in the bit stream (after the splice), what ending buffer fullness will not cause underflows (which can be time consuming or impractical).
In view of the foregoing, it is an object of the invention to provide a method and system for editing video which overcomes the above-described problems. In particular, it is an object of the invention to provide a method for processing a video bit stream so that its ending vbv fullness is above a threshold, which threshold is chosen sufficiently large so that when a subsequent video stream is concatenated to the original video bit stream, the subsequent video bit stream can be decoded without a vbv exception.
In accordance with a first embodiment of the invention, a method for editing video is provided. In accordance with the invention, a previously compressed first digital video bit stream is decoded to obtain a decoded digital video signal. In response to statistical values which characterize the previously compressed first digital video bit stream, the decoded digital video signal is re-encoded to form a second digital video bit stream such that an ending fullness of a vbv does not fall below a predetermined threshold. Optionally, an effect may be added to the decoded digital video signal before re-encoding. The statistics are preferable gathered while encoding the previously compressed digital video bit stream or while decoding the previously compressed digital video bit stream.
A second embodiment of the invention is directed to a method for splicing a first compressed digital video bit stream and a second compressed digital video bit stream. The first compressed digital video bit stream has a plurality of entry points. Each of the entry points has an associated threshold buffer fullness, such that if an actual vbv fullness, just before removal of the bits of a first picture following the entry point equals or exceeds the associated threshold fullness, the portions of the first compressed digital bit stream following the entry point may be decoded without causing the vbv to underflow. Using an encoder, the second compressed digital video bit stream is generated. The second compressed digital video bit stream results in an ending fullness of a vbv one picture time after removal of the bits corresponding to a last picture of the second compressed digital video bit stream. This ending fullness equals or exceeds the threshold fullness associated with one of the entry points. The first and second compressed digital video bit streams are then spliced so that the last picture of the second compressed digital bit stream is immediately followed said one of the entry points in the first compressed digital video bit stream.
A third embodiment of the invention provides a method and system for determining and recording a minimal ending vbv fullness at each of a plurality of entry points in a compressed variable bit rate video bit stream.