Multiview video encoding and decoding is essential for applications such as three dimensional television (3DTV), free viewpoint television (FTV), and multi-camera surveillance. Multiview video encoding and decoding is also known as dynamic light field compression.
FIG. 1 shows a prior art ‘simulcast’ system 100 for multiview video encoding. Cameras 1-4 acquire sequences of frames or videos 101-104 of a scene 5. Each camera has a different view of the scene. Each video is encoded 111-114 independently to corresponding encoded videos 121-124. That system uses conventional 2D video encoding techniques. Therefore, that system does not correlate between the different videos acquired by the cameras from the different viewpoints while predicting frames of the encoded video. Independent encoding decreases compression efficiency, and thus network bandwidth and storage are increased.
FIG. 2 shows a prior art disparity compensated prediction system 200 that does use inter-view correlations. Videos 201-204 are encoded 211-214 to encoded videos 231-234. The videos 201 and 204 are encoded independently using a standard video encoder such as MPEG-2 or H.264, also known as MPEG-4 Part 10. These independently encoded videos are ‘reference’ videos. The remaining videos 202 and 203 are encoded using temporal prediction and inter-view predictions based on reconstructed reference videos 251 and 252 obtained from decoders 221 and 222. Typically, the prediction is determined adaptively on a per block basis, S. C. Chan et al., “The data compression of simplified dynamic light fields,” Proc. IEEE Int. Acoustics, Speech, and Signal Processing Conf., April, 2003.
FIG. 3 shows prior art ‘lifting-based’ wavelet decomposition, see W. Sweldens, “The data compression of simplified dynamic light fields,” J. Appl. Comp. Harm. Anal., vol. 3, no. 2, pp. 186-200, 1996. Wavelet decomposition is an effective technique for static light field compression. Input samples 301 are split 310 into odd samples 302 and even samples 303. The odd samples are predicted 320 from the even samples. A prediction error forms high band samples 304. The high band samples are used to update 330 the even samples and to form low band samples 305. That decomposition is invertible so that linear or non-linear operations can be incorporated into the prediction and update steps.
The lifting scheme enables a motion-compensated temporal transform, i.e., motion compensated temporal filtering (MCTF) which, for videos, essentially filters along a temporal motion trajectory. A review of MCTF for video coding is described by Ohm et al., “Interframe wavelet coding—motion picture representation for universal scalability,” Signal Processing: Image Communication, vol. 19, no. 9, pp. 877-908, October 2004. The lifting scheme can be based on any wavelet kernel such as Harr or 5/3 Daubechies, and any motion model such as block-based translation or affine global motion, without affecting the reconstruction.
For encoding, the MCTF decomposes the video into high band frames and low hand frames. Then, the frames are subjected to spatial transforms to reduce any remaining spatial correlations. The transformed low and high band frames, along with associated motion information, are entropy encoded to form an encoded bitstream. MCTF can be implemented using the lifting scheme shown in FIG. 3 with the temporally adjacent videos as input. In addition, MCTF can be applied recursively to the output low band frames.
MCTF-based videos have a compression efficiency comparable to that of video compression standards such as H.264/AVC. In addition, the videos have inherent temporal scalability. However, that method cannot be used for directly encoding multiview videos in which there is a correlation between videos acquired from multiple views because there is no efficient method for predicting views that accounts for correlation in time.
The lifting scheme has also been used to encode static light fields, i.e., single multiview images. Rather than performing a motion-compensated temporal filtering, the encoder performs a disparity compensated inter-view filtering (DCVF) across the static views in the spatial domain, see Chang et al., “Inter-view wavelet compression of light fields with disparity compensated lifting,” SPIE Conf on Visual Communications and Image Processing, 2003. For encoding, DCVF decomposes the static light field into high and low band images, which are then subject to spatial transforms to reduce any remaining spatial correlations. The transformed images, along with the associated disparity information, are entropy encoded to form the encoded bitstream. DCVF is typically implemented using the lifting-based wavelet transform scheme as shown in FIG. 3 with the images acquired from spatially adjacent camera views as input. In addition, DCVF can be applied recursively to the output low band images. DCVF-based static light field compression provides a better compression efficiency than independently coding the multiple frames. However, that method also cannot encode multiview videos in which both temporal correlation and spatial correlation between views are used because there is no efficient method for predicting views that account for correlation in time.
In certain applications, the depth signal can be part of the input of the system as shown in FIG. 25. In a ‘simulcast’ system 2500, can either acquired at the same time when color video is shot, e.g., using depth cameras 251A, 252A, 253A, and 254A, or be estimated by an offline procedure. Note that the depth is present as an input of the system and the depth is encoded 2500, 2511, 2511A, 2512, 2512A, 2513, 2513A, 2514, and 2514A, and transmitted as part of bitstreams 2521, 2521A, 2522, 2522A, 2523, 2523A, 2524, and 2524A. The depth encoder may or may not be the same as the color encoder.