The present invention relates to an apparatus and method for coding stereoscopic video data. In particular, a system for estimating the optimal offset of a scene between right and left channel views at the same temporal reference point is presented. The system reduces the motion vector search range for disparity (i.e., cross-channel or cross-layer) prediction to improve coding efficiency.
Digital technology has revolutionized the delivery of video and audio services to consumers since it can deliver signals of much higher quality than analog techniques and provide additional features that were previously unavailable. Digital systems are particularly advantageous for signals that are broadcast via a cable television network or by satellite to cable television affiliates and/or directly to home satellite television receivers. In such systems, a subscriber receives the digital data stream via a receiver/descrambler that decompresses and decodes the data in order to reconstruct the original video and audio signals. The digital receiver includes a microcomputer and memory storage elements for use in this process.
The need to provide low cost receivers while still providing high quality video and audio requires that the amount of data which is processed be limited. Moreover, the available bandwidth for the transmission of the digital signal may also be limited by physical constraints, existing communication protocols, and governmental regulations. Accordingly, various intra-frame data compression schemes have been developed that take advantage of the spatial correlation among adjacent pixels in a particular video picture (e.g., frame).
Moreover, inter-frame compression schemes take advantage of temporal correlations between corresponding regions of successive frames by using motion compensation data and block-matching motion estimation algorithms. In this case, a motion vector is determined for each block in a current picture of an image by identifying a block in a previous picture which most closely resembles the current block. The entire current picture can then be reconstructed at a decoder by sending data which represents the difference between the corresponding block pairs, together with the motion vectors that are required to identify the corresponding pairs. Block matching motion estimating algorithms are particularly effective when combined with block-based spatial compression techniques such as the discrete cosine transform (DCT).
Additionally, there has been increasing interest in proposed stereoscopic video transmission formats such as the Motion Picture Experts Group (MPEG) MPEG-2 Multi-view Profile (MVP) system, described in document ISO/IEC JTC1/SC29/WG11 N1088 (ITU-T Recommendation H.262), entitled "Proposed Draft Amendment No. 3 to 13818-2 (Multi-view Profile)," November 1995, and its amendment 3; as well as the MPEG-4 Video Verification Model (VM) Version 3.0, described in document ISO/IEC JTC1/SC29/WG11 N1277, Tampere, Finland, July 1996, both of which are incorporated herein by reference.
Stereoscopic video provides slightly offset views of the same image to produce a combined image with greater depth of field, thereby creating a three-dimensional (3-D) effect. In such a system, dual cameras may be positioned about 2.5 inches, or 65 mm, apart to record an event on two separate video signals. The spacing of the cameras approximates the distance between left and right human eyes, i.e., the inter-ocular separation. Moreover, with some stereoscopic video camcorders, the two lenses are built into one camcorder head and therefore move in synchronism, for example, when panning across an image. The two video signals can be transmitted and recombined at a receiver to produce an image with a depth of field that corresponds to normal human vision. Other special effects can also be provided.
The MPEG MVP system includes two video layers which are transmitted in a multiplexed signal. First, a base (e.g., lower) layer represents a left view of a three dimensional object. Second, an enhancement (e.g., auxiliary, or upper) layer represents a right view of the object. Since the right and left views are of the same object and are offset only slightly relative to each other, there will usually be a large degree of correlation between the video images of the base and enhancement layers. This correlation can be used to compress the enhancement layer data relative to the base layer, thereby reducing the amount of data that needs to be transmitted in the enhancement layer to maintain a given image quality. The image quality generally corresponds to the quantization level of the video data.
The MPEG MVP system includes three types of video pictures; specifically, the intra-coded picture (I-picture), predictive-coded picture (P-picture), and bi-directionally predictive-coded picture (B-picture). Furthermore, while the base layer accommodates either frame or field structure video sequences, the enhancement layer accommodates only frame structure. An I-picture completely describes a single video picture without reference to any other picture. For improved error concealment, motion vectors can be included with an I-picture. An error in an I-picture has the potential for greater impact on the displayed video since both P-pictures and B-pictures in the base layer are predicted from I-pictures. Moreover, pictures in the enhancement layer can be predicted from pictures in the base layer in a cross-layer prediction process known as disparity prediction. Prediction from one frame to another within a layer is known as temporal prediction.
In the base layer, P pictures are predicted based on previous I or P pictures. The reference is from an earlier I or P picture to a future P-picture and is known as forward prediction. B-pictures are predicted from the closest earlier I or P picture and the closest later I or P picture.
In the enhancement layer, a P-picture can be predicted from (a) the most recently decoded picture in the enhancement layer, (b) the most recent base layer picture, in display order, or (c) the next lower layer picture, in display order. Case (b) is used usually when the most recent base layer picture, in display order, is an I-picture.
Moreover, a B-picture in the enhancement layer can be predicted using (d) the most recent decoded enhancement layer picture for forward prediction, and the most recent lower layer picture, in display order, (e) the most recent decoded enhancement layer picture for forward prediction, and the next lower layer picture, in display order, for backward prediction, or (f) the most recent lower layer picture, in display order, for forward prediction, and the next lower layer picture, in display order, for backward prediction. When the most recent lower layer picture, in display order, is an I-picture, only that I-picture will be used for predictive coding (e.g., there will be no forward prediction).
Note that only prediction modes (a), (b) and (d) are encompassed within the MPEG MVP system. The MVP system is a subset of MPEG temporal scalability coding, which encompasses each of modes (a)-(f).
In one optional configuration, the enhancement layer has only P and B pictures, but no I pictures. The reference to a future picture (i.e., one that has not yet been displayed) is called backward prediction. Note that no backward prediction occurs within the enhancement layer. Accordingly, enhancement layer pictures are transmitted in display order. There are situations where backward prediction is very useful in increasing the compression rate. For example, in a scene in which a door opens, the current picture may predict what is behind the door based upon a future picture in which the door is already open.
B-pictures yield the most compression but also incorporate the most error. To eliminate error propagation, B-pictures may never be predicted from other B-pictures in the base layer. P-pictures yield less error and less compression. I-pictures yield the least compression, but are able to provide random access.
For disparity prediction, e.g., where a lower layer image is used as a reference image for an enhancement layer image, either alone or in combination with an enhancement layer reference image. The enhancement layer image is motion compensated by finding a best-match image in the reference image by searching a predefined search area, then differentially encoding the pixels of the enhancement layer image using the pixels of the best-match image of the reference image. A motion vector which defines the relative displacement of the best match image to the coded enhancement layer region is transmitted with the differentially encoded pixel data to allow reconstruction of the enhancement layer image at a decoder. Processing may occur on a macroblock by macroblock basis.
However, the processing and memory storage requirements for disparity prediction are increased when the motion vector search range is increased. Additionally, inefficient variable length coding (e.g., Huffman coding) of disparity vectors results. This results in more expensive and/or slower encoding and decoding apparatus. Accordingly, it would be advantageous to have a system to improve the coding efficiency of disparity predicted enhancement layer images in a stereoscopic video system. The system should account for the inter-ocular separation of a stereoscopic video camera to provide a shifted lower layer image which more closely matches the enhancement layer image. The system should be compatible with various image sizes, including rectangular as well as arbitrarily shaped images.
The system should further be compatible with various existing and proposed video coding standards, such as MPEG-1, MPEG-2, MPEG-4, H.261 and H.263.
The system should provide for the transmission of an offset value for use by a decoder in reconstructing a reference frame. The system should also be effective with video standards that do no allow for the transmission of an offset value by reducing the motion vector search range at an encoder. The technique should be suitable for both still images and sequences of images.
The present invention provides a system having the above and other advantages.