This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Therefore, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.
In multi-view video coding, video sequences output from different cameras, each corresponding to different views of a scene, are encoded into one bitstream. After decoding, to display a certain view, the decoded pictures belonging to that view are reconstructed and displayed. It is also possible for more than one view to be reconstructed and displayed.
Multiview video coding possesses a wide variety of applications, including free-viewpoint video/television, three dimensional (3D) TV and surveillance applications. Currently, the Joint Video Team (JVT) of the International Organization for Standardization (ISO)/International Engineering Consortium (IEC) Motion Picture Expert Group (MPEG) and International Telecommunication Union (ITU)-T Video Coding Expert Group is working to develop a multiview video coding (MVC) standard, which is becoming an extension of the ITU-T H.264 standard, also known as ISO/IEC MPEG-4 Part-10. These draft standards as referred to herein as MVC and AVC, respectively. The latest draft of the MVC standard is described in JVT-T208, “Joint Multiview Video Model (JMVM) 1.0”, 20th JVT meeting, Klagenfurt, Austria, July 2006, can be found at ftp3.itu.ch/av-arch/jvt-site/2006—07_Klagenfurt/JVT-T208.zip, and is incorporated herein by reference in its entirety.
In JMVM 1.0, for each group of pictures (GOP), pictures of any view are contiguous in decoding order. This is depicted in FIG. 1, where the horizontal direction denotes time (with each time instant being represented by Tm) and the vertical direction denotes view (with each view being represented by Sn). Pictures of each view are grouped into GOPs, e.g. pictures T1 to T8 in FIG. 1 for each view form a GOP. This decoding order arrangement is referred to as view-first coding. It should be noted that, for the pictures in one view and in one GOP, although their decoding order is continuous without any other pictures to be inserted between any two of the pictures, internally their decoding order may change.
It is also possible to have a different decoding order than that discussed for first-view coding. For example, pictures can be arranged such that pictures of any temporal location are contiguous in decoding order. This arrangement is shown in FIG. 2. This decoding order arrangement is referred to as time-first coding. It should also be noted that the decoding order of columns (T0, T1, etc.) may not be identical to the temporal order.
A typical prediction structure (including both inter-picture prediction within each view and inter-view prediction) for multi-view video coding is shown in FIG. 3, where predictions are indicated by arrows, and the pointed-to object uses the pointed-from object for prediction reference. For views that share the same sequence parameter set (SPS), JMVM 1.0 provides the dependencies among views in a MVC SPS extension.
According to JMVM 1.0, given a MVC bitstream, for any view to be displayed, the pictures of the view and all other views the view directly or indirectly relies on, must be fully decoded and reconstructed. In this situation, “View A directly depends on view B” means that at least one picture in view B is used by a picture in view A for inter-view prediction. If “View A indirectly depends on view C,” this means that no picture in view C is used by any picture in view A for inter-view prediction, but View A cannot be correctly decoded without View C. For example, if view A directly depends on view B and view B directly depends on view C, then view A indirectly depends on view C. These relationships result significant decoding processing capability requirements, which therefore results in a high decoder implementation complexity and power consumption.
In addition to the above, when the number of views is large, both for time-first and view-first coding, the buffer size required for storing pictures used for inter-view prediction or temporal prediction becomes quite large. For example, when a hierarchical B GOP structure (the coding structure used in the time dimension in FIG. 3) is used in both the time dimension and the view dimension, for view-first coding, the required buffer size is equal tonumber_of_views+GOP_length*(1+log 2(number_of_views))+log 2(GOP_length)In the above equation, “GOP_length” is the length of the GOP in number of pictures. When “GOP_length” is equal to 16 and “number_of_views” is equal to 17, the required buffer size is 101, in units of decoded frames.
As factors such as complexity, power consumption, and buffer size increase, an end result will ultimately involve a higher cost for devices capable of supporting multi-view decoding. These costs will become especially prohibitive for mobile devices, where space constraints inevitably result in still higher component costs. It would therefore be desirable to provide an arrangement where these complexities can be reduced efficiently.