H.264, also referred to as Moving Picture Experts Group-4 Advanced Video Coding (MPEG-4 AVC), is the state of the art video coding standard. It is a hybrid codec which takes advantages of eliminating redundancy between frames and within one frame. The output of the encoding process is Video Coding Layer (VCL) data which is further encapsulated into Network Abstraction Layer (NAL) units prior to transmission or storage. It is a hybrid video coding standard that uses a number of compression technologies that give good compression efficiency.
H.264 is block-based, i.e. a video frame is processed in units of MacroBlocks (MB) which is a 16×16 block of pixels that may be further divided into sub-macroblocks. In order to minimize the amount of data to be coded, a technology called Motion Compensation (MC) is applied on each non-intra block which uses previously reconstructed pixel values in neighboring frames to predict the pixel values of the current block at its best effort. To get a prediction for the current block, reference to an area that is similar to current block in the reference frame is signaled in the bitstream. Final reconstruction can be made by adding the predicted pixel value together with a residual pixel value. In order to find a best match of current coding block in the reference frame, motion search is usually done at the encoder side. It tries to find lowest Sum of Squared Differences (SSD) of Sum of Absolute Differences (SAD) between the current block and possible reference blocks. The outcome of the motion search is a reference frame index signaling which reference frame it refers to and an offset vector called Motion Vector (MV) pointing to the reference area.
There are three types of slices in H.264: I, P and B slices. An I slice contains only data that is coded on its own without referencing any other frames. A P slice contains uni-directional predicted MBs that are referencing respective single areas in a respective other frame. A B slice may contain blocks that refer to reconstructed pixels in I or P slices, or other B slices. Besides that, a B slice may also contain bi-directional predicted MBs where the prediction consists of multiple components that are obtained from different reference areas. Typically the prediction is made by averaging a forward reference and a backward reference. Weighted prediction is a special type of bi prediction where the reference components do not have equal weights. It can provide significant benefits in special cases, such as fade-in scene.
In today's 3D video representations, one of the commonly used formats is “texture+depth”. The texture video represents the actual video texture while the depth map contains all the depth information related to the texture representation. Using view synthesis algorithms, arbitrary number of views can be synthesized from a texture+depth format which can be used in either stereo or autostereoscopic applications. A depth map is usually a grey scale image where the luminance values indicate the distances between the camera and the objects. It can be used together with texture video to create another view. One commonly used type of depth map has the property that the closer the object to the camera, the higher the luminance value is.
Restricted by the bit depth, a depth map only has limited value range. For a bit depth of 8 bits, there can be maximum 256 steps of luminance values. These are far less than enough to represent all the range of real scenes since they can range from clouds at nearly infinity or an ant in front of camera lens. If one considers luminance value 0 as infinity and luminance value 255 as the closest scene the camera can capture, the quantization error will be too big whereas the precision is lost. Fortunately, in real scenarios, a video does not usually focus on both a book close by and a mountain far away. Therefore one can properly assign the limited 256 steps to a local range of interest according to the properties of the scene. To such depth ranges, two parameters are defined. Znear indicates the closest object that can be resolved by a depth value. It typically has luminance value of 255. All the scenes that have a distance between 0 and Znear from the camera are treated as having the depth Znear thus have 255 as their luminance number. Similarly, Zfar indicates the farthest object that can be resolved by depth value. It has luminance value 0. All the scenes that have a distance between Zfar and infinity from the camera are treated as having the depth value Zfar thus have 0 as luminance value. The depths z in-between Znear and Zfar are given by equation 1 below wherein d represents luminance:
                    z        =                  1                                                    d                255                            ⁢                              (                                                      1                                          z                      near                                                        -                                      1                                          z                      far                                                                      )                                      +                          1                              z                far                                                                        (        1        )            
Depth maps are required at the 3D client side to render 3D effects so they are transmitted in the 3D bitstream. To save transmission bandwidth, it is desirable to compress it as much as possible. As of today, there is no dedicated video codec for coding depth map. Normal “texture” video codecs like H.264 are typically used.
Depth dipping plane parameters, i.e. Znear and Zfar, are also transmitted together with depth map as key parameters to define the luminance-depth transform formula (equation 1) presented above. The depth clipping planes can be changing frame by frame based on the scene the camera is shooting. For instance, when a camera is zooming in, it is likely both Znear and Zfar are decreased to suit better the content. Sometimes even for a static scene, Znear and Zfar are modified in order to make special effect the content producer wants to create.
Encoding and decoding of depth maps and in particular predictive encoding and decoding of depth maps can run into problems especially when there are significant changes in the depth clipping plane parameter values between a current frame to be encoded or decoded and reference frames.