The present invention relates to encoding of digital video signals. In particular, a method and apparatus are presented for encoding stereoscopic digital video signals to optimize image quality while maintaining bandwidth limitations. A method and apparatus for improving image quality when editing features such as fast-forward and reverse are invoked is also presented.
Digital technology has revolutionized the delivery of video and audio services to consumers since it can deliver signals of much higher quality than analog techniques and provide additional features that were previously unavailable. Digital systems are particularly advantageous for signals that are broadcast via a cable television network or by satellite to cable television affiliates and/or directly to home satellite television receivers. In such systems, a subscriber receives the digital data stream via a receiver/descrambler that decompresses and decodes the data in order to reconstruct the original video and audio signals. The digital receiver includes a microcomputer and memory storage elements for use in this process.
However, the need to provide low cost receivers while still providing high quality video and audio requires that the amount of data which is processed be limited. Moreover, the available bandwidth for the transmission of the digital signal may also be limited by physical constraints, existing communication protocols, and governmental regulations. Accordingly, various intra-frame data compression schemes have been developed that take advantage of the spatial correlation among adjacent pixels in a particular video picture (e.g., frame).
Moreover, inter-frame compression schemes take advantage of temporal correlations between corresponding regions of successive frames by using motion compensation data and block-matching motion estimation algorithms. In this case, a motion vector is determined for each block in a current picture of an image by identifying a block in a previous picture which most closely resembles the particular current block. The entire current picture can then be reconstructed at a decoder by sending data which represents the difference between the corresponding block pairs, together with the motion vectors that are required to identify the corresponding pairs. Block matching motion estimating algorithms are particularly effective when combined with block-based spatial compression techniques such as the discrete cosine transform (DCT).
However, an even greater challenge is posed now by proposed stereoscopic transmission formats such as the Motion Picture Experts Group (MPEG) MPEG-2 Multi-view Profile (MVP) system, described in document ISO/IEC JTC1/SC29/WG11 N1088, entitled "Proposed Draft Amendment No. 3 to 13818-2 (Multi-view Profile)," November 1995, incorporated herein by reference. Stereoscopic video provides slightly offset views of the same image to produce a combined image with greater depth of field, thereby creating a three-dimensional (3-D) effect. In such a system, dual cameras may be positioned about two inches apart to record an event on two separate video signals. The spacing of the cameras approximates the distance between left and right human eyes. Moreover, with some stereoscopic video camcorders, the two lenses are built into one camcorder head and therefore move in synchronism, for example, when panning across an image. The two video signals can be transmitted and recombined at a receiver to produce an image with a depth of field that corresponds to normal human vision. Other special effects can also be provided.
The MPEG MVP system includes two video layers which are transmitted in a multiplexed signal. First, a base layer represents a left view of a three dimensional object. Second, an enhancement (e.g., auxiliary) layer represents a right view of the object. Since the right and left views are of the same object and are offset only slightly relative to each other, there will usually be a large degree of correlation between the video images of the base and enhancement layers. This correlation can be used to compress the enhancement layer data relative to the base layer, thereby reducing the amount of data that needs to be transmitted in the enhancement layer to maintain a given image quality. The image quality generally corresponds to the quantization level of the video data.
The MPEG MVP system includes three types of video pictures; specifically, the intra-coded picture (I-picture), predictive-coded picture (P-picture), and bi-directionally predictive-coded picture (B-picture). Furthermore, while the base layer accommodates either frame or field structure video sequences, the enhancement layer accommodates only frame structure. An I-picture completely describes a single video picture without reference to any other picture. For improved error concealment, motion vectors can be included with an I-picture. An error in an I-picture has the potential for greater impact on the displayed video since both P-pictures and B-pictures in the base layer are predicted from I-pictures. Moreover, pictures in the enhancement layer can be predicted from pictures in the base layer in a cross-layer prediction process known as disparity prediction. Prediction from one frame to another within a layer is known as temporal prediction.
In the base layer, P pictures are predicted based on previous I or P pictures. The reference is from an earlier I or P picture to a future P-picture and is known as forward prediction. B-pictures are predicted from the closest earlier I or P picture and the closest later I or P picture.
In the enhancement layer, a P-picture can be predicted from the most recently decoded picture in the enhancement layer, regardless of picture type, or from the most recent base layer picture, regardless of type, in display order. Moreover, with a B-picture in the enhancement layer, the forward reference picture is the most recently decoded picture in the enhancement layer, and the backward reference picture is the most recent picture in the base layer, in display order. Since B-pictures in the enhancement layer may be reference pictures for other pictures in the enhancement layer, the bit allocation for the P and B-pictures in the enhancement layer must be adjusted based on the complexity (e.g., activity) of the images in the pictures. In an optional configuration, the enhancement layer has only P and B pictures, but no I pictures.
The reference to a future picture (i.e., one that has not yet been displayed) is called backward prediction. There are situations where backward prediction is very useful in increasing the compression rate. For example, in a scene in which a door opens, the current picture may predict what is behind the door based upon a future picture in which the door is already open.
B-pictures yield the most compression but also incorporate the most error. To eliminate error propagation, B-pictures may never be predicted from other B-pictures in the base layer. P-pictures yield less error and less compression. I-pictures yield the least compression, but are able to provide random access.
Thus, in the base layer, to decode P pictures, the previous I-picture or P-picture must be available. Similarly, to decode B pictures, the previous P or I and future P or I pictures must be available. Consequently, the video pictures are encoded and transmitted in dependency order, such that all pictures used for prediction are coded before the pictures predicted therefrom. When the encoded signal is received at a decoder, the video pictures are decoded and re-ordered for display. Accordingly, temporary storage elements are required to buffer the data before display.
The MPEG-2 standard for non-stereoscopic video signals does not specify any particular distribution that I-pictures, P-pictures and B-pictures must take within a sequence in a layer, but allows different distributions to provide different degrees of compression and random accessibility. One common distribution in the base layer is to have two B-pictures between successive I or P pictures. The sequence of pictures can be, for example, I.sub.1, B.sub.1, B.sub.2, P.sub.1, B.sub.3, B.sub.4, I.sub.2, B.sub.5, B.sub.6, P.sub.2, B.sub.7, B.sub.8, I.sub.3, and so on. In the enhancement layer, a P-picture may be followed by three B-pictures, with an I-pictures being provided for every twelve P and B-pictures, for example, in the sequence I.sub.1, B.sub.1, B.sub.2, P.sub.1, B.sub.3, B.sub.4, P.sub.2, B.sub.5, B.sub.6, P.sub.3, B.sub.7, B.sub.8, I.sub.2. Further details of the MPEG-2 standard can be found in document ISO/IEC JTC1/SC29/WG11 N0702, entitled "Information Technology--Generic Coding of Moving Pictures and Associated Audio, Recommendation H.262," Mar. 25, 1994, incorporated herein by reference.
FIG. 1 shows a conventional temporal and disparity video picture prediction scheme of the MPEG MVP system. The arrow heads indicate the prediction direction such that the picture which is pointed to by the arrow head is predicted based on the picture which is connected to the tail of the arrow. With a base layer (left view) sequence 150 of I.sub.b 155, B.sub.b1 160, B.sub.b2 165, P.sub.b 170, where the subscript "b" denotes the base layer, temporal prediction occurs as shown. Specifically, B.sub.b1 160 is predicted from I.sub.b 155 and P.sub.b 170, B.sub.b2 165 is predicted from I.sub.b 155 and P.sub.b 170, and P.sub.b 170 is predicted from I.sub.b 155. With an enhancement layer (right view) sequence 100 of P.sub.e 105, B.sub.e1 110, B.sub.e2 115, and B.sub.e3 120, where the subscript "e" denotes the enhancement layer, temporal and/or disparity prediction occurs. Specifically, P.sub.e 105 is disparity-predicted from I.sub.b 155. B.sub.e1 110 is both temporally-predicted from P.sub.e 105 and disparity-predicted from B.sub.b1 160. B.sub.e2 115 is temporally-predicted from B.sub.e1 110 and disparity-predicted from B.sub.b2 165. B.sub.e3 120 is temporally-predicted from B.sub.e2 115 and disparity-predicted from P.sub.b 170.
Generally, the base layer in the MPEG MVP system is coded according to the Main Profile (MP) protocol, while the enhancement layer is coded according to the MPEG-2 Temporal Scalability tools.
For fixed bandwidth stereoscopic video services, the output bitstream comprising the multiplex of the base and enhancement layers must not exceed a given bit rate or corresponding bandwidth. This result can be achieved with separate rate control schemes in the base and enhancement layers such that the bit rate for each layer does not exceed a given threshold, and the sum of the two bit rates satisfies the overall bandwidth requirement. Alternately, the bit rate in each layer can be allowed to vary as long as the combined bit rate meets overall bandwidth requirements.
Moreover, the rate control scheme should also provide a relatively constant video signal quality over all picture types (e.g., I, P and B pictures) in the enhancement layer and coincide with the Video Buffering Verifier (VBV) model in the MPEG MVP system. The VBV is a hypothetical decoder which is conceptually connected to the output of an encoder. Coded data is placed in the buffer at the constant bit rate that is being used, and is removed according to which data has been in the buffer for the longest period of time. It is required that the bitstream produced by an encoder or editor does not cause the VBV to either overflow or underflow.
With conventional systems, the quality of a P-picture in the enhancement layer can vary depending on whether it is temporally-predicted or disparity-predicted. For example, for a scene with the cameras panning to the right, with a constant quantization level, a P-picture temporally-predicted from a B-picture in the enhancement layer may have a lower quality than if it was disparity-predicted from an I-picture in the base layer. This is because, as mentioned, B-pictures yield the most compression but also incorporate the most error. In contrast, the quality of a base layer P-picture is maintained since a B-picture may not be used as a reference picture in the base layer. The quality of the P-picture image corresponds to the average quantization step size of the P-picture data.
Moreover, editing operations such as fast-forward and reverse may be performed at a decoder terminal in response to commands provided by a consumer. Such editing operations can result in an encoding error since the group of picture (GOP) or refresh period frames may be different in the base and enhancement layers, and their respective starting points may be temporally offset. The GOP consists of one or more consecutive pictures. The order in which the pictures are displayed usually differs from the order in which the coded versions appear in the bitstream. In the bitstream, the first frame in a GOP is always an I-picture. However, in display order, the first picture in a GOP is either an I-picture, or the first B-picture of the consecutive series of B-pictures which immediately precedes the first I-picture. Furthermore, in display order, the last picture in a GOP is always an I or P-picture.
Furthermore, a GOP header is used immediately before a coded I-frame in the bitstream to indicate to the decoder whether the first consecutive B-pictures immediately following the coded I-frame in the bitstream can be properly reconstructed in the case of a random access, where the I-frame is not available for use as a reference frame. Even when the I-frame is unavailable, the B-pictures can possibly be reconstructed using only backward prediction from a subsequent I or P frame.
When it is required to display a frame which does not immediately follow the GOP header, as during editing operations, synchronization between the base and enhancement layer frames may be destroyed. This can result in a discontinuity that leads to a frame freeze-up or other impairment in the resulting video image.
Accordingly, it would be advantageous to provide a rate control scheme for a stereoscopic video system such as the MPEG MVP system which adjusts the quantization level of P-pictures in the enhancement layer depending on whether the picture is being temporally or disparity-predicted. The scheme should further account for the complexity level of the encoded picture and the reference frame. The scheme should also account for data rate requirements during potential editing operations while providing a uniform image quality and avoiding frame freeze up. The present invention provides the above and other advantages.