The disclosure relates generally to an apparatus and a method for multiview video coding.
Multiview video coding (MVC) is a technique for efficiently compressing view sequences captured simultaneously from multiple cameras, referred to as the multiview (MV) source, using a single video stream. The images of the views captured at a single temporal location may be referred to as a MV image. The MV image may be defined as a group including a base view and one or more spatially referenced views. Since views in the sequences include portions of the same scene, the view sequences include many temporal and spatial statistical dependencies. The MV sequences thus may include many temporally and spatially referenced views. Consequently, a MV image may be efficiently encoded by determining the relative motion between a reference image, e.g., the base view, and the temporally and spatially referenced views. The relative motion may be expressed as a motion shift and represented as a motion vector, for example. Multiview video coding is also applicable for coding free viewpoint video streams.
FIG. 1 is a schematic representation of a multiview source including three cameras configured to capture views of a scene including objects A and B. Each camera comprises an image plane (100, 110, 120), an optical axis (102, 112, 122) perpendicular to the respective image plane, and a view angle (104, 114, 124). At the intersection of the image plane and the optical axis is the optical center of the camera. The optical centers are at different positions (location and/or orientation) in space relative to an orthogonal coordinate system having coordinates X, Y, and Z.
FIG. 2 is a graph illustrating the relationship between the images of a multiview video stream in space and time. On the horizontal axis of the graph are depicted three points in time (1, 2, 3). Three layers are depicted on the vertical axis to represent the views captured by the cameras discussed with reference to FIG. 1. The base layer includes images named base 1, base 2, and base 3. Spatially dependent layer 1 includes images named, respectively, spatially dependent images 1,1, 1,2, and 1,3. Spatially dependent layer 2 includes images named, respectively, spatially dependent images 2,1, 2,2, and 2,3. Exhaustive macroblock (MB) searches can be conducted to determine shifts in order to encode the three views. The arrows connecting the images represent comparisons performed in the exhaustive search. For example, line 202 represents a comparison between images base image 3 and spatially dependent image 1,3 to determine an inter-view motion shift between them (this is referred to as an “inter-view motion shift” as the motion shift is measured between images that are related to the same temporal location or point in time; intra-view motion shift is the motion shift between images that are not all from the same temporal location or point in time). Line 204 represents a comparison between images spatially dependent image 1,2 and spatially dependent image 1,3 to determine a temporal shift between them (i.e., an intra-view motion shift). The purpose of the exhaustive MB search is to match macroblocks in the reference image and the dependent image, where the best match is the one exhibiting the least distortion. The exhaustive MB search uses a number of search modes, reference frames, variable MB sizes, directions of prediction and search ranges to find the best match. Thus, for an encoder that supports two layers, the exhaustive search will be performed at least twice (at least once for an inter-view motion (or spatial) shift and at least once for an intra-view motion (or temporal) shift) for each macroblock in the dependent views. Because multiview video produces a large amount of data, the exhaustive MB search is very complex and expensive in terms of processor cycles and time, and therefore power, and limits the ability to multiview encode in real time, particularly with battery powered devices.
A need exists to reduce multiview video encoding costs, including time, processing and power, and to do so while retaining good visual quality of the resulting multiview video.