Field of the Invention
Embodiments of the present invention generally relate to perceptual three-dimensional (3D) video coding based on depth information.
Description of the Related Art
The demand for digital video products continues to increase. Some examples of applications for digital video include video communication, security and surveillance, industrial automation, and entertainment (e.g., DV, HDTV, satellite TV, set-top boxes, Internet video streaming, digital cameras, cellular telephones, video jukeboxes, high-end displays and personal video recorders). Further, video applications are becoming increasingly mobile as a result of higher computation power in handsets, advances in battery technology, and high-speed wireless connectivity.
Video compression is an essential enabler for digital video products. Compression-decompression (CODEC) algorithms enable storage and transmission of digital video. In general, the encoding process of video compression generates coded representations of frames or subsets of frames. The encoded video bit stream, i.e., encoded video sequence, may include three types of frames: intra-coded frames (I-frames), predictive coded frames (P-frames), and bi-directionally coded frames (B-frames). I-frames are coded without reference to other frames. P-frames are coded using motion compensated prediction from I-frames or P-frames. B-frames are coded using motion compensated prediction from both past and future reference frames. For encoding, frames maybe divided into smaller blocks, e.g., macroblocks of 16×16 pixels in the luminance space and 8×8 pixels in the chrominance space for the simplest sub-sampling format of H.264/AVC or the quadtree derived coding units of the emerging High Efficiency Video Coding (HEVC) standard.
Video coding standards (e.g., MPEG, H.264, HEVC, etc.) are based on the hybrid video coding technique of block motion compensation and transform coding. Block motion compensation is used to remove temporal redundancy between blocks of a frame and transform coding is used to remove spatial redundancy in the video sequence. Traditional block motion compensation schemes basically assume that objects in a scene undergo a displacement in the x- and y-directions from one frame to the next. Motion vectors are signaled from the encoder to a decoder to describe this motion. As part of forming the coded signal, a block transform is performed and the resulting transform coefficients are quantized to reduce the size of the signal to be transmitted and/or stored.
In some video coding standards, a quantization parameter (QP) is used to modulate the step size of the quantization for each block. For example, in H.264/AVC and HEVC, quantization of a transform coefficient involves dividing the coefficient by a quantization step size. The quantization step size, which may also be referred to as the quantization scale, is define by the standard based on the QP value, which may be an integer from 0 to 51. A step size for a QP value may be determined, for example, using a table lookup and/or by computational derivation. The quality and bit rate of the coded bit stream is determined by the QP value selected by the encoder for quantizing each block. The use of coarser quantization encodes a frame using fewer bits but reduces image quality while the use of finer quantization encodes a frame using more bits but increases image quality. Further, in some standards, the QP values may be modified within a frame. For example, in various versions of the MPEG standard and in H.263 and H.264/AVC, a different QP can be selected for each 16×16 block in a frame. In HEVC, a different QP can be selected for each coding unit.
The block-based coding and use of quantization may cause coding artifacts in the decoded video. For two-dimensional (2D) video, perceptually-based quantization techniques have been used to make these coding artifacts less visible to the human eye. Such techniques vary the QP value for blocks in a frame to distribute the noise and artifacts according to masking properties of the human visual system (HVS). The goal is to maximize the visual quality of an encoded video sequence while keeping the bit rate low. For example, according to HVS theory, the human visual system performs texture masking (also called detail dependence, spatial masking or activity masking). That is, the discrimination threshold of the human eye increases with increasing picture detail, making the human eye less sensitive to quantization noise and coding artifacts in busy or highly textured portions of frames and more sensitive in flat or low-textured portions. During video encoding, this texture masking property of the HVS can be exploited by shaping the quantization noise in the video frame based on the texture content in the different parts of the video frame. More specifically, the quantization step size can be increased in highly textured portions, resulting in coarser quantization and a lower bit rate requirement, and can be decreased in low-textured or flat portions to maintain or improve video quality, resulting in finer quantization but a higher bit rate requirement. The human eye will perceive a “noise-shaped” video frame as having better subjective quality than a video frame which has the same amount of noise evenly distributed throughout the video frame.