This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.
Video coding standards include ITU-T H.261, ISO/IEC Moving Picture Experts Group (MPEG)-1 Visual, ITU-T H.262 or ISO/IEC MPEG-2 Video, ITU-T H.263, ISO/IEC MPEG-4 Visual and ITU-T H.264 (also know as ISO/IEC MPEG-4 Advanced Video Coding (AVC)). In addition, there have been efforts with regards to the development of new video coding standards. One such standard is the scalable video coding (SVC) standard, which is the scalable extension to H.264/AVC. Another such standard which is just finalized, is the multiview video coding (MVC) standard, which becomes another extension to H.264/AVC.
In multiview video coding, video sequences output from different cameras, each corresponding to different views, are encoded into one bit-stream. After decoding, to display a certain view, the decoded pictures belonging to that view are reconstructed and displayed. It is also possible for more than one view to be reconstructed and displayed.
Multiview video coding has a wide variety of applications, including free-viewpoint video/television, 3D TV, and surveillance applications. Currently, the Joint Video Team (JVT) of ISO/IEC Motion Picture Expert Group (MPEG) and ITU-T Video Coding Expert Group is working to develop a MVC standard, which is becoming an extension of H.264/AVC. These standards are referred to herein as MVC and AVC, respectively. The latest working draft of MVC is described in JVT-AB204, “Joint Draft Multi-view Video Coding,” 28th JVT meeting, Hannover, Germany, July 2008, available at ftp3.itu.ch/av-arch/jvt-site/2008_07_Hannover/JVT-AB204.zip.
Besides the features defined in the working draft of MVC, other potential features, particularly those focusing on coding tools, are described in the Joint Multiview Video Model (JMVM). The latest version of JMVM is described in JVT-AA207, “Joint Multiview Video Model (JMVM) 8.0,” 24th JVT meeting, Geneva, Switzerland, April 2008, available at ftp3.itu.ch/av-arch/jvt-site/2008_04_Geneva/JVT-AA207.zip.
FIG. 1 is a representation showing a conventional MVC decoding order (i.e., bitstream order). The decoding order arrangement is referred to as time-first coding. Each access unit is defined to contain the coded pictures of all the views (e.g., S0, S1, S2, . . . ) for one output time instance (e.g., T0, T1, T2, . . . ). It should be noted that the decoding order of access units may not be identical to the output or display order. A conventional MVC prediction (including both inter-picture prediction within each view and inter-view prediction) structure for multi-view video coding is shown in FIG. 2. In FIG. 2, predictions are indicated by arrows, with each pointed-to object using the respective point-from object for prediction reference.
An anchor picture is a coded picture in which all slices reference only slices with the same temporal index, i.e., only slices in other views and not slices in earlier pictures of the current view. An anchor picture is signaled by setting the anchor_pic_flag to 1. After decoding the anchor picture, all following coded pictures in display order can be decoded without inter-prediction from any picture decoded prior to the anchor picture. If a picture in one view is an anchor picture, then all pictures with the same temporal index in other views shall also be anchor pictures. Consequently, the decoding of any view can be started from a temporal index that corresponds to anchor pictures. Pictures with an anchor_pic_flag equal to 0 are referred to as non-anchor pictures.
In the Joint Draft of MVC, view dependencies are specified in the sequence parameter set (SPS) MVC extension. The dependencies for anchor pictures and non-anchor pictures are independently specified. Therefore, anchor pictures and non-anchor pictures can have different view dependencies. However, for the set of pictures that refer to the same SPS, all of the anchor pictures must have the same view dependency, and all of the non-anchor pictures must have the same view dependency. In the SPS MVC extension, dependent views can be signaled separately for the views used as reference pictures in RefPicList0 and RefPicList1. Within one access unit, when view component A directly depends on view component B, this means that view component A uses view component B for inter-view prediction. If view component B directly depends on view component C, and if view component A does not directly depend on view component C, then view component A indirectly depends on view component C.
In the Joint Draft of MVC, there is also an inter_view_flag in the network abstraction layer (NAL) unit header which indicates whether the current picture is used for inter-view prediction for the pictures in other views. In this draft, inter-view prediction is supported by only texture prediction, i.e., only the reconstructed sample values may be used for inter-view prediction, and only the reconstructed pictures of the same output time instance as the current picture are used for inter-view prediction. After the first byte of the NAL unit (NALU), an NAL unit header extension (3 bytes) follows. The NAL unit header extension includes the syntax elements that describe the properties of the NAL unit in the context of MVC.
As a coding tool in JMVM, motion skip predicts macroblock (MB) modes and motion vectors from the inter-view reference pictures and it applies to non-anchor pictures only. During encoding, a global disparity motion vector (GDMV) is estimated when encoding an anchor picture, and GDMVs for non-anchor pictures are then derived so that the GDMVs for a non-anchor picture is a weighted average from the GDMVs of the two neighboring anchor pictures. A GDMV is of 16-pel precision, i.e., for any MB in the current picture (i.e. the picture being encoded or decoded), the corresponding region shifted in an inter-view reference picture according to the GDMV covers exactly one MB in the inter-view reference picture.
Based on this GDMV, for each non-anchor picture, the GDMV is scaled. For each MB, if the MB utilizes motion skip, a local offset of the disparity motion vector is signaled. At the decoder, if motion skip mode is used, the final disparity motion vector is used to find the motion vectors in the inter-view pictures and the motion vectors are copied from the inter-view pictures.
3D video has recently garnered significant interest. Furthermore, with advances in acquisition and display technologies, 3D video is becoming a reality within the consumer domain via the use of different application opportunities. Given a certain maturity of capture and display technologies, and with the help of MVC techniques, a number of different envisioned 3D video applications are becoming more feasible. It should be noted that 3D video applications can be generally grouped into three categories: free-viewpoint video; 3D TV (video); and immersive teleconferencing. The requirements of these applications can be quite different and realizing each type of 3D video application has its own challenges.
When transmitting 3D content based on 2D images, the bandwidth constraint becomes an issue, and thus a powerful compressor is required to code the 3D content with only a reasonable number of views. However, at a client device, for example, a user may require the experience of viewing the 3D content at any angle, e.g., with view navigation or auto-stereoscopic video. Therefore, it is desirable for a decoder to render as many views as possible and to do so as continuously as possible. View synthesis can address this bandwidth constraint by transmitting a reasonable number of views while interpolating other views at the renderer. Within the MPEG video subgroup, Exploration Experiments in 3D Video Coding (3DV EE) are being performed to study a similar application scenario. It is also claimed that having depth map videos for each view is potentially helpful for view synthesis.
Furthermore, MPEG has also specified a format for attaching a depth map for a regular video stream in MPEG-C part 3. This specification is described within “Text of ISO/IEC FDIS 23002-3 Representation of Auxiliary Video and Supplemental Information,” N8768 of ISO/IEC JTC 1/SC 29/WG 11, Marrakech, Morocco, January 2007.
In MPEG-C part 3, a so-called auxiliary video can be either a depth map or a parallax map. A texture video typically consists of three components, namely one luma component Y, and two chroma components U and V, whereas a depth map only has one component representing the distance between an object pixel and the camera. Generally, a texture video is represented in YUV 4:2:0, 4:2:2 or 4:4:4 format, where one chroma sample (U or V) is coded for each 4, 2, or 1 luma sample, respectively. A depth map is regarded as luma-only video in YUV 4:0:0 format. Depth maps can be inter-coded similarly to inter-coded luma-only texture pictures, and hence coded depth maps can have motion vectors. When representing a depth map, it provides flexibilities in terms of the number of bits used to represent each depth value. For example, the resolution of the depth map can be, for example, ¼ the width and ½ the height of an associated image).
It should be noted that determining which video codec is used is an application issue, although the end result is to be able to, e.g., code the depth map video as a monochromatic video (4:0:0). For example, the depth map can be coded as an H.264/AVC bitstream with only a luminance component. Alternatively, the depth map can be coded as an auxiliary video defined in H.264/AVC. In H.264/AVC, auxiliary pictures are coded independently of the primary pictures, and hence there is no prediction between the primary coded pictures for sample values and the auxiliary coded pictures for depth values.
View synthesis for 3D video rendering is improved when depth information for each picture of a view (i.e., depth map video) is provided. Because a depth map video can consume a large part of the whole bandwidth for an entire bitstream (especially when each view is associated with a depth map), the coding of depth map video should be efficient enough to save the bandwidth.
Generally and as described above, depth map videos, if existent, are coded independently. However, correlations can exist between a texture video and its associated depth map. For example, the motion vectors in a coded depth map and those in a coded texture video could be similar. It can be foreseen that sample prediction between depth map and texture is inefficient and almost useless but motion prediction between depth map images and texture images is beneficial.
For multiview video content, MVC is the “state-of-art” coding standard. Based on the MVC standard, it is not possible to code depth map videos and texture videos in one MVC bitstream, and in the meantime, enabling motion prediction between depth map images and texture images.