Field of the Invention
The present disclosure relates to a multiview video coding method and device. More particularly, the present disclosure relates to a multiview video coding method and device used in coding of non-referenced view video groups. The method and device can determine a view compensation pattern and a parallelization view pattern based on video characteristics, such as the number of bits occurring in each of frames of a first non-referenced view video group among the non-referenced view video groups, the difference between the number of bits of each of the frames of the first non-referenced view video group and the number of bits of a left reference view image, the difference between the number of bits of each of the frames of the first non-referenced view video group and the number of bits of a right reference view image, and can determine a view compensation pattern and a parallelization view pattern of a successive non-referenced view video group input in succession to the first non-referenced view video group as the view compensation pattern and the parallelization view pattern of the first non-referenced view video group, thereby coding multiview video images at a high coding rate without deteriorating image quality.
Description
As digital video is developing from high definition video into ultra definition video, three-dimensional (3D) video services have been introduced. The three-dimensional audio video (3DAV) group established the standard for 3D multiview video by performing new standardization of 3D audio/video technology which has been included in the standardization of the moving picture expert group (MPEG) since 2001. It is expected that in the future a variety of applications using 3D multiview video will be actively developed.
3D multiview video refers to a series of 3D images obtained using a plurality of cameras, which could not be obtained by existing imaging methods used for obtaining two-dimensional (2D) images using a single view camera. The key concept of compression coding technology including multiview video coding of 3D video is to more effectively compress and encode 3D video using not only temporal and spatial redundancy but also the redundancy between camera views.
However, the most significant problem in the compression coding technology of 3D multiview video is in performing predictable coding of the time, space, and views between images obtained using the plurality of cameras, in proportion to multiview. The coding compression of 3D multiview video performs the predictable coding of the time, space, and views between a plurality of images, which takes up 70% to 80% of overall coding compression calculations, thereby significantly increasing the overall amount of coding compression calculations.
FIG. 1 is a functional block diagram illustrating a typical multiview video compression coding method of the related art.
Referring to FIG. 1, a plurality of images S1, S2, . . . , and Sn obtained using a plurality of cameras are coded, thereby being formed as a bit stream. In a first image coding device 10 for coding a first image among the plurality of images obtained using a first camera, a motion estimator 11 estimates the motion of a current unit macro block that is input in a unit macro block size. That is, the motion estimation of a current unit macro block searches a reference frame region for a unit macro block matching the current unit macro block. A closest matching candidate macro block is selected by comparing all of, or portions of each of, available unit macro blocks within the reference frame region with the current unit macro block. Here, the sizes of the unit macro blocks are 4×4, 4×8, 8×4, 8×8, 8×16, 16×8, and 16×16.
A motion compensator 13 produces an error value by balancing the current unit macro block and the selected candidate macro block. The motion estimation and compensation is performed on the current unit macro block according to the sizes of the unit macro blocks, which are 4×4, 4×8, 8×4, 8×8, 8×16, 16×8, and 16×16. A prediction mode determiner 15 determines the coding prediction mode of each unit macro block, i.e. the size of the coding and compression of the unit macro block, based on an error value produced by performing the motion estimation and compensation on the unit macro block.
An encoder 17 performs a transform, such as a discrete cosine transform (DCT) or a wavelet transform, on the motion vectors of the error value and the unit macro value produced according to the determined coding prediction mode, and quantizes transformed data, thereby removing spatially redundant elements. The encoder 17 generates a bit stream of the first image from the motion vectors of the error block and the unit macro block through the transform and quantization. The first image coding device 10 for coding the first image obtained using the first camera generates a reference frame to be used in later image prediction by decoding the quantized data. This will not be described further since a detailed description thereof is clearly provided in the H.264 standard.
A second image coding device 20 for coding a second image among the plurality of images obtained using a second camera is generally divided into a view predicting part and a time predicting part. The view predicting part calculates an error value of the second image based on the difference in the view between the second image and the first image. The time predicting part calculates an error value of the second image based on the difference in the time between the reference frame and the current macro block of the second image through the motion compensation.
The view predicting part for the second image will be described in detail hereinafter. First, a view estimator 21 estimates a difference in the view between the unit macro block of the first image and the unit macro block of the second image, i.e. estimates the view of the unit macro block of the first image and the view of the unit macro block of the second image. The view estimator 21 searches for a unit macro block of the first image obtained using the first camera that matches a unit macro block of the second image obtained using the second camera. A view compensator 22 produces an error value by balancing the searched unit macro block of the first image and the unit macro block of the second image.
The MPEG 3-dimensional audio video (3DAV) sets standard view video and 3D audio/video technologies for the prediction between views. Studies into reducing the problems of the existing approaches of view video processing by changing the structure of a group of groups of pictures (GoGOP) are being undertaken. The GoGOP is an extensive concept of the group of pictures (GOP) representing a group of frames in a single view, and represents a group of groups of pictures according to the number of views.
The structure of the GoGOP for view video processing of the related art has an anchor structure. Since an I frame is provided in each of views and coding is independently performed according to views, the predictable coding is inefficient. In order to overcome the problems of the anchor structure, a hierarchical B picture structure was proposed. Unlike the anchor group, the GoGOP of the hierarchical B picture structure sets an I frame in the first view and allows the other views to be referred to by each other.
FIG. 2 illustrates an example of the GoGOP of the hierarchical B picture structure.
Describing FIG. 2 in greater detail, square blocks represent view frames in a view video source. A vertical arrow represents the sequence of the frames according to views or camera positions, and a horizontal arrow represents the time sequence of the frames. Arrows between the frames represent the directions of prediction, in which horizontal arrows mean the directions of predicted motions, and vertical arrows mean the directions of predicted views. I frames represent “intra-frames,” which are identical to the I frames in the MPEG-2/4 or H.264 standard, and P frames and B frames represent “prediction frames” and “bidirectional prediction frames” similar to those in the MPEG-2/4 or H.264 standard.
As illustrated in FIG. 2, it is apparent that the sequences of frames in each view are formed of different frames. The 0th view S0 includes I frames and B frames, the first view S1 includes B frames and b frames, and the second view S2 includes P frames and B frames.
Korean Patent No. 10-1383486 discloses a multiview video coding method intended to reduce the amount of calculations considering that a picture group in a view including B frames or b frames is not used for prediction by the other picture groups in the GoGOP of this hierarchical B picture (frame) structure, and that a user watching multiview video actually uses only two views. Hereinafter, a picture group in a view including B frames or b frames is referred to as a non-referenced view video group. The multiview video coding method disclosed in Korean Patent No. 10-1383486 can reduce the amount of calculations used for parallelization by setting the non-referenced view video group parallel to an adjacent left reference view image or an adjacent right reference view image instead of parallelizing all frames of multiview video. It is also possible to reduce the distortion of video by setting the non-referenced view video group parallel to an adjacent reference view image.
However, in the related-art approaches as described above, there is no indication of which one of the adjacent left reference view image and the adjacent right reference view image is to be taken as a basis for parallelization or view compensation on the non-referenced view video group.
The information disclosed in the Background section is only provided for a better understanding of the background and should not be taken as an acknowledgment or any form of suggestion that this information forms prior art that would already be known to a person skilled in the art.