1. Field of the Invention
The present invention relates to an editing system, method and apparatus for editing images and, more particularly, an editing system, method and apparatus for seamlessly splicing a plurality of bit streams of video data.
2. Related Art
Recording/reproducing systems have recently been introduced which record/reproduce high quality audio/video data utilizing compression schemes. High quality recording/reproducing systems compression-encode/decode the audio/video data utilizing the MPEG (Moving Picture Experts Group) standard. One example of such a system is the DVD (Digital Versatile Disk or Digital Video Disk), which provides a powerful means by which unprecedented quantities of high quality audio/video are compressed on an optical disk.
FIG. 1 illustrates the general recording/reproducing system. The video encoder ill of the encoding-side apparatus 110 encodes input video data DV in accordance with the MPEG standard to thereby produce a video elementary stream (video ES). The packetizer 112 packetizes the video elementary stream into a video packetized elementary stream (video PES) comprising access units; each access unit representing a picture in a group of pictures making up a portion of the video program. The audio encoder 113 of the encoding-side apparatus encodes input audio data DA to thereby produce an audio elementary stream (audio ES) The packetizer 114 formats the audio elementary stream into an audio packetized elementary stream (audio PES) comprising access units; each access unit represent decodable segment of an audio bit stream. The transport stream multiplexer 115 multiplexes the audio and video packetized elementary streams to thereby produce a transport stream packet. A Video Buffer Verifier (VBV) buffer (not shown) stores/retrieves the multiplexed streams at a variable target rate which is controlled in accordance with the number of bits to be encoded and the capacity of the VBV buffer. An illustration of the Video Buffer Verifier is provided with reference to FIG. 2.
The decoding-side apparatus 120 of FIG. 1 stores in a decoding-side Video Buffer Verifier (VBV) buffer (not shown) the received transport stream which is transmitted via the transmission medium 116. The transport stream demultiplexer 121 demultiplexes the received transport stream fetched from the decoding buffer at a timing determined by a decoding time stamp (DTS) to thereby reproduce the video packetized elementary stream (video PES) and the audio packetized elementary stream (audio PES). The video packetized elementary stream is depacketized by depacketizer 122 and decoded by video decoder 123 thereby reproducing the video data DV. The audio packetized elementary stream is depacketized by depacketizer 124 and decoded by audio decoder 125 thereby reproducing the audio data DA. For DVD applications, the transport stream multiplexer 115 and the transport stream demultiplexer 121 are respectively replaced with a program stream multiplexer and demultiplexer which DVD format/unformat the encoded bit streams.
In the recording/reproducing system of FIG. 1, it is desirable to seamlessly splice a plurality of bit streams by concentrating at the transport level two or more different elementary streams representing the merger of different video programs. In digital broadcasting, for example, editors at a broadcasting station splice a plurality of bit streams from different video sources such as, for example, live video feeds received from local stations for generating a spliced broadcast video program. In DVD applications, the director splices movie scenes to be recorded on the DVD optical disk. In another DVD application, the DVD decoder splices multiple bit streams reproduced from the DVD optical disk in response to user-entered actions which is particularly useful for generating alternate scenes for interactive movies and video games.
There are, however, unforeseen difficulties to splicing a plurality of bit streams using the MPEG compression standard. In order to illuminate the problem, a closer look at MPEG is warranted. In summary, the MPEG standard implements a compression process which includes motion-compensated predictive coding in conjunction with adaptive Discrete Cosine Transform (DCT) quantization. The motion-compensated predictive coding predicts motion in each image frame/field using both unidirectional and bidirectional motion prediction. The DCT quantization adaptively compresses each frame/field in accordance with the motion-compensated prediction. The term “frames” hereinafter refers to pictures in general including frames as well as fields.
As illustrated in FIG. 3(a), motion-compensated prediction of the MPEG compression standard classifies the frames into one of three types: intracoded-frames (I-frames), predictively coded frames (P-frames) and bi-directionally coded frames (B-frames). MPEG establishes the I-frames as the reference by which the B- and P-frames are encoded and, thus, preserves the I-frames as complete frames. The I-frames are considered “intra-coded” since they proceed as complete frames, having bypassed the motion-compensated prediction, to the DCT quantization whereupon each I-frame is compression encoded with reference only to itself. P-frames, which rely on forward temporal prediction, are coded using the previous I- or P-frame. B-frames are coded using bi-directional (forward and/or backward) motion compensated predictive encoding using the two adjacent I- and/or P-frames. B- and P-frames are considered “inter-coded” since they are motion-prediction encoded with reference to other frames FIG. 7 illustrates an example of the direction of prediction for each I, B and P-frame in a group of pictures (GOP) as indicated by the arrows in the figure.
In accordance with the MPEG standard, frames are arranged in ordered groups of pictures (GOP), each group of pictures comprising a closed set of I-, B- and P-frames which are encoded with reference to only those frames within that group. FIG. 3(a) illustrates the natural presentation order (1 to 15) of the GOP in which the pictures are naturally presented to the viewer. Since the B- and P-frames within the GOP are encoded with reference to other frames, the MPEG standard dictates that the natural presentation order shown in FIG. 3(a) be rearranged into the decoding order shown in FIG. 3(b) in which the frames are to be decoded and transmitted in the coded order shown in FIG. 3(c). With this arrangement, the frames necessary for decoding other frames are first decoded to provide the basis upon which the following inter-coded frames are decoded. For example, an I-frame which forms the reference by which the following frames in the GOP are motion-compensation predicted is positioned first in the decoding order. Once decoded, the pictures are rearranged in their natural presentation order for display to the viewer.
Motion-compensated predictive coding divides each I-, B- and P-frame into 8×8 pel macroblocks. The motion vectors for a present frame are motion-compensation predicted with reference to the motion vectors of another frame which is selected in accordance with the direction of prediction of the type of frame (e.g., I-, B- or P-frame) For example, P-frame macroblocks are motion-predicted with reference to the macroblocks in a previous I or P-frame; B-frame macroblocks are motion-predicted with reference to the previous/successive I- and/or P-frames. The I-frames, which are not inter-coded, bypass motion compensation and are directly DCT quantized.
The process for motion-predicting a current picture in a GOP is illustrated in FIGS. 4(a)–(e). The GOP are input in the natural presentation order shown in FIG. 4(a), rearranged in accordance with the decoding order shown in FIG. 4(b), motion-predicted utilizing two frame memories (FM1, FM2) as shown in FIGS. 4(c) and (d) and output in the form of the encoding stream (ES) shown in FIG. 4(e). For example, the I-frame (13) of FIG. 4(b) is intra-coded and, therefore, output directly to the encoding stream (ES); the B-frame (B1) of FIG. 4(b) is motion predicted with reference to the I-frame (I3) stored in the first frame memory (FM1) of FIG. 4(c) and the P-frame (P) stored in the second frame memory (FM2) of FIG. 4(d); the P-frame (P6) of FIG. 4(b) is motion predicted with reference to the I-frame (I3) stored in the first frame memory (FM1) of FIG. 4(c). From the foregoing illustration, it is apparent that a minimum of two frame memories are needed for bi-directional motion prediction.
After the motion vectors are calculated, each macroblock is Discrete Cosine Transform (DCT) encoded. More particularly, the macroblocks are transformed from pixel domain to the DCT coefficient domain. Next, adaptive quantization is performed on each block of DCT coefficients in accordance with a variable quantization step size. After adaptive quantization is applied to the DCT coefficients, the coefficients undergo further compression involving such techniques as differential coding, run-length coding or variable length coding. The encoded data is stored/retrieved to/from the Video Buffer Verifier (VBV) buffer at a controlled target bit rate in the form of a serial bit stream.
FIG. 2 illustrates a locus of the data occupancy of the VBV buffer wherein the bits (oordinate) of the I-, B- and P-frames are stored in the VBV buffer along a time axis (presentation time Tp-abscissa) at a transmission bit rate (inclination 131) and output from the VBV buffer as indicted by the vertical lines. The VBV buffer is considered a “virtual” buffer because it emulates the buffer on the decoding side. By controlling the amount of bits 132 of the VBV buffer on the encoding side, it can be assured that the appropriate amount of bits per decoding time stamp (DTS), i.e. target bit rate, is transmitted to the decoding side. This is important in MPEG where the number of bits for a particular frame varies depending upon the motion-prediction type. The I-frames in FIG. 2, for example, require four times the amount of storage time (VBV buffer delay) as the P-frames and twice the B-frames. For that matter, care must be taken that the varied amount of bits in a GOP does not cause an overflow when the number of bits exceeds the buffer capacity (upper-hatched line) or an underflow when the number of bits drops below a predetermined minimum number (lower-hatched line) which will sustain an efficient encoding/decoding process.
Referring to FIGS. 5A–C, the decoding process for decoding the transmitted group of pictures (GOP) is explained. The coded order shown in FIG. 5(a) is received by the decoding side apparatus 120 (FIG. 1) and stored in the decoding-side VBV buffer. The transport stream demultiplexer 121 demultiplexes the stream into the packetized elementary stream illustrated in FIG. 5(b). The GOP are decoded by fetching the compressed picture data from the decoding-side buffer at a timing determined by the decoding time stamp (DTS), de-compressing the fetched picture data and reconstructing each I-, B- and P-frame from the decompressed picture data. It will be appreciated that the I-frames are complete upon decompression. The B- and P-frames are reconstructed by motion estimating the previously decoded frames based on the decompressed motion vectors of the current B- or P-frame. Afterwards, the decoded frames are rearranged in their original presentation order for display as shown in FIG. 5(c).
When it is considered that the decoding-side apparatus requires relatively less hardware complexity than the encoding-side, the wisdom of the MPEG encoding/decoding scheme will be immediately recognized. To explain, the complex hardware necessary to perform motion prediction is not a part of the decoding-side apparatus since the decoder need only apply the motion vectors to the encoded pictures. The high quality audio/video is, thus, generated by a high-end encoder for distribution enmasse to numerous, considerably less-complex (and less-expensive) decoders.
The motion decoding process is illustrated in FIGS. 6(a)–(d) wherein FIG. 6(a) shows the coded video elementary stream (ES) which is supplied to the decoder. A first frame memory (FM1) as illustrated in FIG. 6(b) stores a first previously-decoded picture for decoding the current picture. A second frame memory (FM2) as illustrated in FIG. 6(c) stores a second previously-decoded picture for decoding the current picture. For example, the decoded I-frame (I3) (first picture in the ES of FIG. 6(a)) is stored in the first frame memory (FM1) and the P-frame (previous ES) is stored in the second frame memory (FM2). In this example, the B-frame (B1) is decoded by motion estimating the frames in the frame memories (FM1, FM2) based on the motion vectors of B1. The decoded GOP are output in the presentation order illustrated in FIG. 6(d)
With the rudiments of the MPEG standard explained, the difficulties confronted when splicing coded streams will be better appreciated. In the conventional editing system for splicing bit streams, it is recognized that the bit streams must be decoded. This is because the prediction direction of the first stream may be inconsistent with that of the second. To explain, the selected direction of prediction (forward/backward) for the B-frames mutually effects the prediction direction of other B-frames and, for that matter, defines which frames are selected for the motion prediction throughout the GOP. When two coded bit streams are spliced arbitrarily, for example, the prediction direction for a frame in the first coded bit stream may be decoded with reference to a frame with an inconsistent prediction direction in the second coded bit stream. For this reason, motion estimation upon decoding in the area of the splicing point will result in reconstructing an incorrect picture. The error, referred to as a discontinuity, migrates to other frames in motion estimation, consequently effecting the motion estimation decoding of the GOP as a whole. This discontinuity manifests as visible macroblocks on the display when, for example, the channel of a digital television is changed.
In order to prevent discontinuity, it is suggested to decode the bit streams before splicing. When the bit streams are decoded, the frames thereof are not motion predicted, i.e., not encoded with reference to other frames and thus are not subject to the discontinuity of the foregoing method. However, the spliced bit stream must be re-encoded. Since MPEG coding is not a 100% reversible process, the signal quality is deteriorated when re-encoding is performed. The problem is compounded because the re-encoding process encodes a decoded signal, i.e., a degraded version of the original audio/video signal.
A splicing technique which addresses signal deterioration selectively decodes the bit streams at a splicing point. However, such a splicing technique produces unsatisfactory results. The first problem arises in the presentation order of the spliced stream which may be understood with reference to FIGS. 8(a)–(d) to 9(a)–(d). FIGS. 8(a)–(d) illustrate the ideal case where no problems arise in the presentation order of the spliced stream STSP. In this case, stream STA of FIG. 8(a) is spliced at the splicing point SPA with stream STB of FIG. 8(b) at the splicing point SPB. Thus, the spliced bit stream STSP of FIG. 8(c) presents the pictures of stream STA followed by the pictures of stream STB without problem.
FIGS. 9(a) to (d) illustrate the problem where the decoder rearranges the presentation order of the spliced bit stream. Stream STA of FIG. 9(a) is bit-spliced with stream STB of FIG. 9(b) at respective splicing positions (SPA, SPB). Unlike the ideal case, the decoder on the decoding-side rearranges the order of presentation of the frames of the spliced bit stream STSP (FIG. 9(c)) such that, in this example, the last frame (P-frame) in bit stream STA is inserted at the third-picture position of stream STB. This appears visually as an arbitrary picture inserted in the video program.
The second problem, hereinafter termed “crossover”, arises in motion estimation upon decoding of the spliced bit stream. In the ideal case illustrated in FIGS. 10(a), (b) the motion estimation reconstructs the pictures of stream STA of the spliced bit stream STSP of FIG. 10(a) with reference to only those frames from that stream. This is indicated by the arrows in FIG. 10(b) which represent the motion estimation direction. Likewise, stream STB is motion estimated with reference to only those pictures in that stream.
FIGS. 11(a) and (b) illustrate the problem of crossover motion estimation. For example, the P-frame in stream STA is based on frames in stream STB as illustrated by the hatched arrows labeled “NG” in FIG. 11(b). Thus, the P-frame in stream STB is reconstructed from the wrong picture which appears visually as a distorted image. This problem is propagated through the GOP as shown in FIGS. 12(a), (b) when the incorrectly-estimated P-frame of stream STB is utilized by the decoder to motion estimate other frames. This results in a number of distorted pictures which are quite noticeable.
FIGS. 13(a) to 18(b) illustrate the third problem of underflow/overflow related to splicing bit streams. The ideal case is illustrated in FIGS. 13(a)–(d) wherein three streams (STA, STB, STC) are spliced at splicing points SPV and a buffer occupancy VOC. FIG. 13(a) illustrates the locus of the data occupancy of the video buffer verifier (VBV) buffer on the decoding side wherein I-, B- and P-frames are stored in the VBV buffer. FIG. 13(b) illustrates the spliced stream STSP, FIG. 13(c) the timing at which each of the pictures is generated after rearrangement and FIG. 13(c) the order of the pictures after the decoding operation. As will be appreciated from FIG. 13(a), the instant case does not present a problem of overflow (upper-hatched line) or underflow (lower-hatched line).
The problematic case is illustrated in FIGS. 14(a)–16(b). By themselves, bit streams STA, STB do not pose an overflow/underflow problem as will be appreciated from FIGS. 14(a), 15(a). However, when the bit streams STA, STB are spliced as illustrated in FIGS. 16(a), (b) at a splicing point SPV an overflow/underflow condition occurs. The overflow condition which is illustrated in FIGS. 17(a), (b) occurs when bit stream STB continues to fill the VBV buffer to a point where the VBV buffer overflows as indicated at 141 in FIG. 17(a). The underflow case which is illustrated in FIGS. 18(a) and (b) occurs when stream STB does not thereafter fill the VBV buffer by a sufficient amount thereby resulting in an underflow 142 shown in FIG. 18(b). In the decoding-side apparatus (IRD), either an overflow or underflow of the VBV buffer consequently results in a failure in decoding pictures on the decoding-side. It is not a typical to see the effects of overflow/underflow manifesting as the skipping, freezing or interruption of the images.
Heretofore, there has been no solution for providing a seamlessly-spliced bit stream from a plurality of bit streams without the serious defects illustrated in the foregoing examples.