The term multiview video coding is used to describe processes that encode video captured by multiple cameras from different viewpoints. The basic approach of most multiview coding schemes is to exploit not only the redundancies that exist temporally between the frames within a given view, but also the similarities between frames of neighboring views. By doing so, a reduction in bit rate relative to independent coding of the views can be achieved without sacrificing the reconstructed video quality. The primary usage scenario for multiview video is to support 3D video applications, where 3D depth perception of a visual scene is provided by a 3D display system. There are many types of 3D display system including classic stereo systems that require special-purpose glasses to more sophisticated multiview auto-stereoscopic displays that do not utilize glasses. The stereo systems utilize two views, where a left-eye view is presented to the viewer's left eye, and a right-eye view is presented to the viewer's left eye.
Another application of multiview video is to enable free-viewpoint video. In this scenario, the viewpoint and view direction can be interactively changed. Each output view can either be one of the input views or a virtual view that was generated from a smaller set of multiview inputs and other data that assists in the view generation process. With such a system, viewers can freely navigate through the different viewpoints of the scene.
Multiview video contains a large amount of inter-view statistical dependencies, since all cameras capture the same scene from different viewpoints. Therefore, combined temporal and inter-view predictions can be utilized to more efficiently encode multiview video. Stated another way, a frame from a certain camera can be predicted not only from temporally related frames from video captured by the same camera, but also from frames of video captured at the same time by neighboring cameras. A sample prediction structure is shown in FIG. 1. Frames are not only predicted from temporal references, but also from inter-view references. The prediction is adaptive, so the best predictor among temporal and inter-view references can be selected on a block basis in terms of rate-distortion cost, or a combination of both temporal and inter-view reference can be used for different portions of the video frame.
Multiview Video Coding (MVC, ISO/IEC 14496-10:2008 Amendment 1) is an extension of the H.264/MPEG-4 Advanced Video Coding (AVC) standard that provides efficient coding of multiview video. The basic H.264/MPEG-4 AVC standard covers a Video Coding Layer (VCL) and a Network Abstraction Layer (NAL). While the VCL creates a coded representation of the source content, the NAL formats these data and provides header information in a way that enables simple and effective customization of the use of VCL data for a broad variety of systems
A coded H.264/MPEG-4 AVC video data stream is organized into NAL units, which are packets that each contain an integer number of bytes. A NAL unit starts with a one-byte indicator of the type of data in the NAL unit. The remaining bytes represent payload data. NAL units are classified into video coding layer (VCL) NAL units, which contain coded data for areas of the frame content (coded slices or slice data partitions), and non-VCL NAL units, which contain associated additional information. The set of consecutive NAL units associated with a single coded frame is referred to as an access unit. A set of consecutive access units with certain properties is referred to as an encoded video sequence. An encoded video sequence (together with the associated parameter sets) represents an independently decodable part of a video bitstream. An encoded video sequence always starts with an instantaneous decoding refresh (IDR) access unit, which signals that the IDR access unit and all access units that follow it in the bitstream can be decoded without decoding any of the frames that preceded it.
The VCL of H.264/MPEG-4 AVC follows the so-called block-based hybrid video coding approach. The way frames are partitioned into smaller coding units involves partitioning frames into slices, which are in turn subdivided into macroblocks. Each slice can be parsed independently of the other slices in the frame. Each frame is partitioned into macroblocks that each covers a rectangular area of 16×16 luma samples and, in the case of video in 4:2:0 chroma sampling format, 8×8 sample areas of each of the two chroma components. The samples of a macroblock are either spatially or temporally predicted, and the resulting prediction residual signal is represented using transform coding. Depending on the degree of freedom for generating the prediction signal H.264/MPEG-4 AVC supports three basic slice coding types that specify the types of coding supported for the macroblocks within the slice. An I slice uses intra-frame coding involving spatial prediction from neighboring regions within a frame. A P slice supports both intra-frame coding and inter-frame predictive coding using one signal for each prediction region (i.e. a P slice references one other frame of video). A B slice supports intra-frame coding, inter-frame predictive coding, and also inter-frame bi-predictive coding using two prediction signals that are combined with a weighted average to form the region prediction (i.e. a B slice references two other frames of video). In referencing different types of predictive coding, both inter-frame predictive coding and inter-frame bi-predictive coding can be considered to be forms of inter-frame prediction.
In H.264/MPEG-4 AVC, the coding and display order of frames is completely decoupled. Furthermore, any frame can be used as reference frame for motion-compensated prediction of subsequent frames, independent of its slice coding types. The behavior of the decoded picture buffer (DPB), which can hold up to 16 frames (depending on the supported conformance point and the decoded frame size), can be adaptively controlled by memory management control operation (MMCO) commands, and the reference frame lists that are used for coding of P or B slices can be arbitrarily constructed from the frames available in the DPB via reference picture list modification (RPLM) commands.
A key aspect of the MVC design extension to the H.264/MPEG-4 AVC standard is that it is mandatory for the compressed multiview stream to include a base view bitstream, which is coded independently from all other views. The video data associated with the base view is encapsulated in NAL units that have previously been defined for the 2D video, while the video associated with the additional views are encapsulated in an extension NAL unit type that is used for both scalable video coding (SVC) and multiview video. A flag is specified to distinguish whether the NAL unit is associated with an SVC or MVC bitstream.
Inter-view prediction is a key feature of the MVC design, and it is enabled in a way that makes use of the flexible reference frame management capabilities that are part of H.264/MPEG-4 AVC, by making the decoded frames from other views available in the reference frame lists from other views for use in inter-frame prediction. Specifically, the reference frame lists are maintained for each frame to be decoded in a given view. Each such list is initialized as usual for single-view video, which would include the temporal reference frames that may be used to predict the current frame. Additionally, inter-view reference frames are included in the list and are thereby also made available for prediction of the current frame.
In MVC, inter-view reference frames are contained within the same access unit as the current frame, where an access unit contains all the NAL units pertaining to a certain capture or display time instant (see for example the access units shown in FIG. 1). The MVC design does not allow the prediction of a frame in one view at a given time using a frame from another view at a different time. This would involve inter-view prediction across different access units.
With respect to the encoding of individual slices and macroblocks, the core macroblock-level and lower-level decoding modules of an MVC decoder are the same, regardless of whether a reference frame is a temporal reference or an inter-view reference. This distinction is managed at a higher level of the decoding process.
To achieve access to a particular frame in a given view, the decoder should first determine an appropriate access point. In H.264/MPEG-4 AVC, each IDR frame provides a clean random access point. In the context of MVC, an IDR frame in a given view prohibits the use of temporal prediction for any of the views on which a particular view depends at that particular instant of time; however, inter-view prediction may be used for encoding the non-base views of an IDR frame. This ability to use inter-view prediction for encoding an IDR frame reduces the bit rate needed to encode the non-base views, while still enabling random access at that temporal location in the bitstream. Additionally, MVC also introduces an additional frame type, referred to as an anchor frame for a view. Anchor frames are similar to IDR frames in that they do not use temporal prediction for the encoding of any view on which a given view depends, although they do allow inter-view prediction from other views within the same access unit (see for example FIG. 1). Moreover, it is prohibited for any frame that follows the anchor frame in both bitstream order and display order to use any frame that precedes the anchor frame in bitstream order as a reference for inter-frame prediction, and for any frame that precedes the anchor frame in decoding order to follow it in display order. This provides a clean random access point for access to a given view.
Many cameras, including cameras in mobile phone handsets, support geotagging of captured still and video images using geographic information captured using a Global Positioning System (GPS) receiver and other sensors such as accelerometers, and magnetometers. Geotagging is the process of adding geographical identification metadata to media. The geotag metadata usually includes latitude and longitude coordinates, though a geotag can also include altitude, bearing, distance, tilt, accuracy data, and place names. Geotags can be associated with a video sequence and/or with individual frames within the video sequence.