In recent years, as high definition (HD) broadcast services are spreading domestically and globally, a large number of users are getting used to high-resolution and high-quality videos and accordingly institutions put spurs to the development of next-generation video devices. Also, with growing interest in ultrahigh-definition (UHD) services having a resolution four times higher than HDTV, compression techniques for higher-quality videos are needed.
For video compression, there may be used an inter prediction technique of predicting pixel values included in a current picture from temporally previous and/or subsequent pictures of the current picture, an intra prediction technique of predicting pixel values included in a current picture using pixel information in the current picture, or an entropy encoding technique of assigning a short code to a symbol with a high appearance frequency and assigning a long code to a symbol with a low appearance frequency.
Video compression technology may include a technique of providing a constant network bandwidth in restricted operating environments of hardware without considering variable network environments. However, to compress video data used for network environments involving frequent changes of bandwidths, new compression techniques are required, wherein a scalable video encoding/decoding method may be employed.
Meanwhile, a three-dimensional (3D) video provides a 3D effect to users through a stereoscopic 3D display apparatus as if the users see and feel a real world. In this connection, the Moving Picture Experts Group (MPEG) as a working group of International Organization for Standardization (ISO)/International Electrotechnical Commission (IEC) to set standards for video technologies is conducting studies on 3D video standards. 3D video standards include standards for advanced data formats, which support representation of not only stereoscopic images but also auto-stereoscopic images using real images and a depth map thereof, and for relevant technologies.
FIG. 1 illustrates a basic structure of a 3D video system, which is currently considered in 3D video standards.
As shown in FIG. 1, a transmitter side generating content (3D content producer) acquires N-view (N≥2) picture contents using a stereo camera, a depth camera, a multi-camera setup and two-dimensional (2D)/3D conversion of converting a 2D picture into a 3D picture.
The acquired picture contents may include N-view video information (N×Video) and depth map information thereof and camera related side information.
The N-view picture contents are compressed using a multi-view video encoding method, and a compressed bit stream is transmitted to a terminal through a network, for example, digital video broadcasting (DVB).
A receiver side decodes the transmitted bit stream using a multi-view video decoding method, for example, depth-image-based rendering (DIBR), to reconstruct N-view pictures.
The reconstructed N-view pictures generate virtual-view pictures from N views or greater by DIBR.
The virtual-view pictures from the N views or greater are reproduced suitably for various stereoscopic display apparatuses, for instance, 2D display, M-view 3D display and head-tracked stereo display, to provide stereoscopic pictures to users.
A depth map used to generate a virtual-view picture represents a distance between a camera and an object (a depth corresponding to each pixel in the same resolution as that of a real picture) in the real world expressed as a certain bit number.
FIG. 2 illustrates a depth map of picture “balloons” being used in MPEG standards for 3D video coding.
In FIG. 2, (a) is a real picture of picture “balloons,” (b) is a depth map of picture “balloons.” In (b), a depth is expressed as 8 bits per pixel.
H.264/AVC (MPEG-4 Part 10 Advanced Video Coding) may be used as an example for coding the real picture and the depth map thereof. Alternatively, High Efficiency Video Coding (HEVC), as an international video compression standard jointly developed by the MPEG and Video Coding Experts Group (VCEG), may be employed.
FIG. 3 illustrates an inter-view prediction structure in a 3D video codec.
A real picture and a depth map thereof may be images obtained not only by a single camera but also a plurality of cameras. Pictures obtained by a plurality cameras may be encoded independently, in which a general 2D video coding codec may be used.
Further, the pictures obtained by the plurality of cameras have correlations in view and accordingly may be encoded using different inter-view predictions so as to enhance encoding efficiency.
As shown in FIG. 3, viewpoint 1 (view 1) is a picture captured by a left camera based on viewpoint 0 (view 0), while viewpoint 2 (view 2) is a picture captured by a right camera based on View 0.
View 1 and view 2 may be inter-view predicted using view 0 as a reference picture, in which case view 0 needs encoding prior to view 1 and view 2. Here, view 0 may be encoded independently of other views and thus be referred to as an independent view.
On the contrary, view 1 and view 2 use view 0 as a reference picture and thus may be referred to as dependent views. An independent view picture may be encoded using a general 2D video codec, whereas a dependent view picture needs to be inter-view predicted and thus may be encoded using a 3D video codec including an inter-view prediction process.
Further, view 1 and view 2 may be encoded using a depth map so as to increase encoding efficiency.
FIG. 4 is a block diagram schematically illustrating a video encoder and a video decoder which encode and decode a texture and a depth.
As shown in FIG. 4, the video encoder 410 includes a texture encoder 415 and a depth encoder 417, and the video decoder 420 includes a texture decoder 425 and a depth decoder 427.
The texture encoder 415 receives an input of a texture corresponding to a real picture and encodes the texture into a bit stream, and the texture decoder 425 receives the bit stream encoded by the texture encoder 415 and decodes the bit stream to output the decoded texture.
The texture encoder 417 encodes a depth, that is, a depth map, and the texture decoder 427 decodes the depth map.
When a real picture and a depth map thereof are encoded, the real picture and the depth map thereof may be encoded/decoded separately.
Further, when the picture and the depth map are encoded as in FIG. 4, the picture and the depth map may be encoded/decoded by referring to each other, that is, dependently. A real picture may be encoded/decoded using an already encoded/decoded depth map, and a depth map may be encoded/decoded likewise using an already encoded/decoded real picture.
FIG. 5 illustrates a prediction structure of 3D picture coding. Specifically, FIG. 5 illustrates an encoding prediction structure for encoding real pictures captured by three cameras and depth maps thereof.
In FIG. 5, three real pictures are represented by T0, T1 and T2 depending on viewpoints, and three depth maps at the same positions as those of the real pictures are represented by D0, D1 and D2.
Here, T0 and D0 are pictures obtained from view 0, T1 and D1 are pictures obtained from view 1, and T2 and D2 are pictures obtained from view 2. The respective pictures may be encoded into an intra picture (I), a uni-prediction picture (P) and a bi-prediction picture (B).
Prediction methods for deriving motion information on a current block from a real picture may be largely divided into temporal prediction and inter-view prediction. Temporal prediction is a prediction method using a temporal correlation at the same view, while inter-view prediction is a prediction method using an inter-view correlation. Temporal prediction and inter-view prediction may be used in a combination for predicting a single picture. The motion information may include at least one of a motion vector, a reference picture number, prediction direction information indicating whether unidirectional prediction or bidirectional prediction is used, and information indicating whether inter-view prediction, temporal prediction, or another prediction is used.
In FIG. 5, an arrow represents a prediction direction, and the real pictures and the depth maps thereof may be encoded/decoded dependently on each other. That is, the depth maps may be referenced for predicting the real pictures, and the real pictures may be referenced for predicting the depth maps.
However, to decode a 3D picture, implementation complexity of hardware and software increase and computational complexity also increase.