1. Field of the Invention
Methods and apparatuses consistent with the present invention relate to encoding and decoding a multi-view moving picture, and more particularly, to a disparity vector estimation method of quickly encoding a multi-view moving picture and improving compressibility of the multi-view moving picture, and a method and apparatus for encoding and decoding a multi-view moving picture using the disparity vector estimation method.
2. Description of the Related Art
Realism is an important factor in achieving high-quality information and telecommunication services. This realism can be achieved with video communication based on three-dimensional (3D) images. 3D image systems have many potential applications in education, entertainment, medical surgery, videoconferencing, and the like. To provide viewers with more vivid and accurate information of a remote scene, three or more cameras are placed at slightly different viewpoints to produce a multi-view sequence.
Reflecting the current interest in 3D images, a number of research groups have developed 3D-image processing and display systems. In Europe, research on 3DTV has been initiated through several projects such as DISTIMA, the objective of which is to develop a system for capturing, coding, transmitting, and displaying digital stereoscopic image sequences. These projects have led to another project, PANORAMA, with the goal of enhancing visual information in 3D telepresence communication. These projects have also led to another project, ATTEST, in which various technologies for 3D-content acquisition, 3D-compression & transmission, and 3D-display systems were researched. In the ATTEST project, Motion Picture Experts Group 2 (MPEG-2) and Digital Video Broadcasting (DVB) standards were applied to transmit 3D contents using temporal scalability. In temporal scaling, a base layer is used for the transmission of 2D contents and an enhancement layer is used for the transmission of 3D contents.
The MPEG-2 standard was amended in 1996 to define a multiview profile (MVP). The MVP defines the usage of a temporal scalability mode for multi-camera sequences and acquisition camera parameters in an MPEG-2 syntax.
A base-layer stream which represents a multiview video signal can be encoded at a reduced frame rate, and an enhancement-layer stream, which can be used to insert additional frames in between, can be defined to allow reproduction at a full frame rate when both streams are available. A very efficient way to encode the enhancement layer is to determine the optimal method of performing motion-compensated estimation on each macroblock in an enhancement layer frame based on either a base layer frame or a recently reconstructed enhancement layer frame.
The process of stereo and multiview channel encoding such a multiview video signal using temporal scalability syntax is straightforward. For this purpose, a frame from a particular camera view (usually a left-eye frame) is defined as the base layer, and a frame from the other camera view is defined as the enhancement layer. For the enhancement layer, although disparity-compensated estimation may fail in occluded regions, it is still possible to maintain the quality of a reconstructed image using motion-compensated estimation within the same channel. Since the MPEG-2 MVP was mainly defined for stereo sequences, it does not support multiview sequences and is inherently difficult to extend to multiview sequences.
FIG. 1 is a block diagram illustrating an encoder and decoder of the MPEG-2 MVP.
Referring to FIG. 1, the MPEG-2 MVP (13818-2) encodes and reproduces a three-dimensional (3D) moving picture using a left view picture and a right view picture, by utilizing a scalable codec that detects the correlation between the left and right view pictures and variably encodes a difference between the left and right view pictures according to a network status. Here, the left view picture is defined as a base layer moving picture and the right view picture is defined as an enhancement layer picture. The base layer picture can be encoded in its original form, and the enhancement layer picture is additionally encoded and transmitted in order to enhance the quality of the base layer moving picture when the network status is stable. As such, encoding using both the base layer moving picture and the enhancement layer picture is called scaleable coding.
The left view picture is encoded by a first motion compensated DCT encoder 110. A difference between the left view picture and the right view picture is calculated by a disparity estimator 122 for estimating the difference of the disparity between the left view picture and the right view picture, and a disparity compensator 124, and then is encoded by a second motion compensated DCT encoder 126. The first motion compensated DCT encoder 110 for encoding the left view picture is referred to as a base layer picture encoder, and the disparity estimator 122, the disparity compensator 124, and the second motion compensated DCT encoder 126 for encoding a disparity between the right view picture and the left view picture constitute an enhancement layer picture encoder 120. The encoded base layer picture and the enhancement layer picture are multiplexed by a system multiplexer 130 and then transferred to a decoder.
The multiplexed signal is divided into a left view picture and a right view picture by a system demultiplexer 140. The left view picture is decoded by a first motion compensated DCT decoder 150. A disparity picture is restored to the right view picture by a second motion compensated DCT decoder 164 and a disparity compensator 162 which compensates for the disparity between the left view picture and the right view picture. The first motion compensated DCT decoder 150 for decoding the left view picture is referred to as a base layer picture decoder, and the disparity compensator 162 and the second motion compensated DCT decoder 164 for measuring the disparity between the right view picture and the left view picture and decoding the right view picture constitute an enhancement layer picture decoder 160.
FIG. 2 is a view for explaining disparity-based estimation encoding in which disparity estimation is used twice for bi-directional motion estimation.
A left view picture is encoded by a non-scalable MPEG-2 encoder, and a right view picture is encoded by a MPEG-2 temporal auxiliary view encoder on the basis of the decoded left view picture.
That is, the right view picture is encoded to a bi-directional (B) image using estimation results obtained from two reference pictures, for example, two left view pictures. One of the two reference pictures is a left view picture to be displayed simultaneously with the right view picture and the other reference picture is a left view picture to be displayed temporally subsequently.
Also, the two estimation results have three estimation modes including a forward mode, a backward mode, and an interpolated mode, similar to motion estimation/compensation. Here, the forward mode indicates a disparity estimated from the isochronal left view picture, and the backward mode indicates a disparity estimated based on the left view picture that immediately follows the isochronal left view image. In this method, since a right view picture is estimated by disparity vectors of two left view pictures, this estimation method is called disparity-based estimation encoding. Accordingly, the encoder estimates two disparity vectors for each frame of a right view moving picture, and the decoder decodes the right moving picture from left view moving pictures using the two disparity vectors.
FIG. 3 is a view for explaining estimation encoding using disparity vectors and motion vectors for interpolated estimation.
In FIG. 3, B pictures are used for interpolated estimation as illustrated in FIG. 2. However, here, the interpolated estimation uses a disparity estimation and a motion estimation. That is, a disparity estimation result obtained from an isochronal left view picture and a motion estimation result obtained from a right view picture at the previous time are used.
Like disparity-based estimation encoding, estimation encoding using disparity vectors and motion vectors also include three estimation modes comprising a forward mode, a backward mode, and an interpolated mode. Here, the forward mode indicates a motion estimation obtained from a decoded right view picture, and the backward mode indicates a disparity estimation obtained from a decoded left view picture.
As described above, since the MPEG-2 MVP specification does not consider itself an encoder for a multi-view moving picture, it is not designed to be suitable for an actual stereo moving picture. Therefore, an encoder which is able to efficiently provide a multi-view moving picture in order to provide a three-dimensional effect and reality simultaneously to a plurality of people, is needed.
A new H.264 video coding standard for high encoding efficiency compared to related art standards has been developed. The new H.264 video coding standard depends on various new characteristics, considering a variable block size between 16×16 and 4×4, a quadtree structure for motion compensation in a loop deblocking filter, a multi-reference frame, intra prediction, and context adaptability entropy coding, as well as considering general B estimation slices. Unlike the MPEG-2 standard, the MPEG-4 Part 2 standard, etc., the B slices can be referred different slices while using multi-prediction obtained from the same direction (forward or backward). However, the above-described characteristics require a great amount of bits for motion information including an estimation mode and motion vector and reference image in an estimation mode for the H.264 video coding standard.
In order to overcome this problem, a skip mode and a direct mode can be respectively introduced into predictive (P) slices and B slices. The skip and direct modes allow motion estimation of an arbitrary block of a picture to be currently encoded, using motion vector information previously encoded. Accordingly, additional motion data for macroblocks (MBs) or blocks is not encoded. Motions for these modes are obtained using spatial (skip) or temporal (direct) correlation of motions of adjacent MBs or pictures.
FIG. 4 is a view for explaining a direct mode of a B picture.
In the direct mode, a forward motion vector and a backward motion vector are obtained using a motion vector of a co-located block of a temporally following P image, when estimating a motion of an arbitrary block of a B picture to be currently encoded.
In order to calculate a forward motion vector MVL0 and a backward motion vector MVL1 of a direct mode block 402 whose motion will be estimated in a B picture 410, a motion vector MV for a reference list 0 image 430, which has a co-located block 404 (which is at the same position as the direct mode block 402) in a reference list 1 picture 420 as a temporally following picture is detected. Thus, the forward motion vector MVL0 and the backward motion vector MVL1 of the direct mode block 402 of the B picture 410 are calculated using Equation 1 as follows:
                                                        MV              →                                      L              ⁢                                                          ⁢              0                                =                                                    TR                B                                            TR                D                                      ×                          MV              →                                      ⁢                                  ⁢                                                            MV                →                                            L                ⁢                                                                  ⁢                1                                      =                                                            (                                                            TR                      B                                        -                                          TR                      D                                                        )                                                  TR                  D                                            ×                              MV                →                                              ,                                    (        1        )            where MV represents the motion vector of the co-located block 404 of the reference list 1 picture 420, TRD represents a distance between the reference list 0 picture 430 and the reference list 1 picture 420, and TRB represents a distance between the B picture 410 and the reference list 0 picture 430.
FIG. 5 is a view for explaining a method of estimating a motion vector in a spatial area.
According to the H.264 standard used for encoding moving picture data, a frame is divided into blocks, each having a predetermined size, and a motion searching for a most similar block to an adjacent frame(s) subjected to encoding is performed. That is, an intermediate value of motion vectors of a left macroblock 4, an upper middle macroblock 2, and an upper right macroblock 3 of a current macroblock c is determined as an estimation value of the corresponding motion vector. The motion vector estimation can be expressed by Equation 2 as follows:
                    {                                                            pmvx                =                                  MEDIAN                  ⁡                                      (                                                                  mvx                        ⁢                                                                                                  ⁢                        2                                            ,                                              mvx                        ⁢                                                                                                  ⁢                        3                                            ,                                              mvx                        ⁢                                                                                                  ⁢                        4                                                              )                                                                                                                          pmvy                =                                  MEDIAN                  ⁡                                      (                                                                  mvy                        ⁢                                                                                                  ⁢                        2                                            ,                                              mvy                        ⁢                                                                                                  ⁢                        3                                            ,                                              mvy                        ⁢                                                                                                  ⁢                        4                                                              )                                                                                                          (        2        )            
As such, a method of encoding a moving picture using spatial correlation as well as temporal correlation has been proposed. However, a method of enhancing the compressibility and processing speed of a multi-view moving picture having significantly more information than a general moving picture, is still required.