Multiview video encoding is essential for applications such as 3D television (3DTV), free viewpoint television (FTV), and multi-camera surveillance. Multiview video encoding is also known as dynamic light field compression.
FIG. 1 shows a prior art ‘simulcast’ system 100 for multiview video encoding. Cameras 1-4 acquire ‘views’ 101-104 of a scene, where the input views from each camera are typically time synchronized. The views are encoded 111-114 independently to corresponding encoded views 121-124. That system uses conventional 2D video encoding techniques. However, that system does not correlate the different camera views. Independent encoding decreases compression efficiency, and thus network bandwidth and storage are increased. For a large number of cameras, inter-view correlation would greatly increase the efficiency of a multiview encoder.
FIG. 2 shows a prior art disparity compensated prediction system 200 that uses inter-view correlations. Views 201-204 are encoded 211-214 to encoded views 231-234. The views 201 and 204 are encoded independently using a standard video encoder such as MPEG-2 or H.264. These independently encoded views are ‘reference’ views. The remaining views 202-203 are encoded using temporal prediction and inter-view predictions based on reconstructed reference views 251-252 obtained from decoders 221-222. Typically, the prediction is determined adaptively on a per block basis, S. C. Chan, K. T. Ng, Z. F. Gan, K. L. Chan, and H.-Y. Shum, “The data compression of simplified dynamic light fields,” Proc. IEEE Int. Acoustics, Speech, and Signal Processing Conf., April, 2003.
FIG. 3 shows a prior art ‘lifting-based’ wavelet decomposition, see W. Sweldens, “The Lifting Scheme: A Custom-Design Construction of Biorthogonal Wavelets,” J. Appl. Comp. Harm. Anal., vol. 3, no. 2, pp. 186-200, 1996. Wavelet decomposition is an effective technique for static light field compression. Input samples 301 are split 310 into odd samples 302 and even samples 303. The odd samples are predicted 320 from the even samples. A prediction error forms high band samples 304. The high band samples are used to update 330 the even samples to form low band samples 305. That decomposition is invertible so that linear or non-linear operations can be incorporated into the prediction and update steps.
The lifting scheme enables a motion-compensated temporal transform, i.e., motion compensated temporal filtering (MCTF), which for videos essentially filters along a temporal motion trajectory. A review of MCTF for video coding is described by Ohm et al, “Interframe wavelet coding—motion picture representation for universal scalability,” Signal Processing: Image Communication, Vol. 19, No. 9, pp. 877-908, October 2004. The lifting scheme can be based on any wavelet kernel such as Harr or 5/3 Daubechies, and any motion model such as block-based translation or affine global motion, without affecting a perfect reconstruction.
For encoding, the MCTF decomposes the video into high band frames and low band frames, which are then subjected to spatial transforms to reduce any remaining spatial correlations. The transformed low and high band frames, along with associated motion information, are entropy coded to form an encoded bitstream. MCTF can be implemented with the lifting scheme shown in FIG. 3, with the temporally adjacent videos as input. In addition, MCTF can be applied recursively to the output low band frames.
MCTF-based videos have a compression efficiency comparable to that of video compression standards such as H.264/AVC. In addition, the videos have inherent temporal scalability. However, that method cannot be applied directly to multiview video coding in which a correlation between the multiple views is exploited because there is no efficient method for predicting views that also accounts for correlation in time.
The lifting scheme has also been used to encode static light fields, i.e., single multiview images. Rather than performing a motion-compensated temporal filtering, the encoder performs a disparity compensated inter-view filtering (DCVF) across the static views in the spatial domain, see Chang, et al, “Inter-view wavelet compression of light fields with disparity compensated lifting,” SPIE Conf on Visual Communications and Image Processing, 2003.
For encoding, DCVF decomposes the static light filed into high and low band images, which are then subject to spatial transforms to reduce any remaining spatial correlations. The transformed images, along with the associated disparity information, are entropy encoded to form the encoded bitstream. DCVF is typically implemented using the lifting-based wavelet transform scheme as shown in FIG. 3 with the images from spatially adjacent camera views as input. In addition, DCVF can be applied recursively to the output low band images. DCVF-based static light field compression provides a better compression efficiency than independently coding the multiple images. However, that method cannot be applied directly to multiview video encoding in which both the temporal correlation and correlation between views are exploited because there is no efficient method for predicting views that also accounts for correlation in time.
Therefore, there is a need for a compression method that exploits both temporal and inter-view correlations in multiview videos using wavelet transforms.