Over the years, telepresence has become increasingly important in many applications including computer supported collaborative work (CSCW) and entertainment.
Such 3D video processing poses a major technical challenge. First, there is the problem of how a 3D video bitstream should be encoded for efficient processing, communications and storage. Second, there is the problem of extracting and reconstructing real moving 3D objects from the videos. Third, it is desired to render the object from arbitrary viewpoints.
Most prior art 3D video bitstreams are formatted to facilitate off-line post-processing and hence have numerous limitations that makes them less practicable for advanced real-time 3D video processing.
Video Acquisition
There are a variety of known methods for reconstructing objects from 2D videos. These can generally be classified as requiring off-line post-processing methods and real-time methods. The post-processing methods can provide point sampled representations, however, not in real-time.
Spatio-temporal coherence for 3D video processing is used by Vedula et al., “Spatio-temporal view interpolation,” Proceedings of the Thirteenth Eurographics Workshop on Rendering, pp. 65-76, 2002. There, a 3D scene flow for spatio-temporal view interpolation is computed, however, not in real-time.
A dynamic surfel sampling representation for estimation 3D motion and dynamic appearance is also known. However, that system uses a volumetric reconstruction for a small working volume, again, not in real-time, see Carceroni et al., “Multi-View scene capture by surfel sampling: From video streams to non-rigid 3D motion, shape & reflectance,” Proceedings of the 7th International Conference on Computer Vision,” pp. 60-67, 2001. Wurmlin et al., in “3D video recorder,” Proceedings of Pacific Graphics '02, pp. 325-334, 2002, describe a 3D video recorder which stores a spatio-temporal representation in which users can freely navigate.
In contrast to post-processing methods, real-time methods are much more demanding with regard to computational efficiency. Matusik et al., in “Image-based visual hulls,” Proceedings of SIGGRAPH 2000, pp. 369-374, 2000, describe an image-based 3D acquisition system which calculates the visual hull of an object. That method uses epipolar geometry and outputs a view-dependent representation. Their system neither exploits spatio-temporal coherence, nor is it scalable in the number of cameras, see also Matusik et al., “Polyhedral visual hulls for real-time rendering,” Proceedings of Twelfth Eurographics Workshop on Rendering, pp. 115-125, 2001.
Triangular texture-mapped mesh representation are also known, as well as the use of trinocular stereo depth maps from overlapping triples of cameras, again mesh based techniques tend to have performance limitations, making them unsuitable for real-time applications. Some of these problems can be mitigated by special-purpose graphic hardware for real-time depth estimation.
Video Standards
As of now, no standard for dynamic, free viewpoint 3D video objects has been defined. Auxiliary components of the MPEG-4 standard can encode depth maps and disparity information. However, those are not complete 3D representations, and shortcomings and artifacts due to DCT encoding, unrelated texture motion fields, and depth or disparity motion fields still need to be resolved. If the acquisition of the video is done at a different location than the rendering, then bandwidth limitations on the transmission channel are a real concern.
Point Sample Rendering
Although point sampled representations are well known, none can efficiently cope with dynamically changing objects or scenes, see any of the following U.S. Pat. No. 6,509,902, “Texture filtering for surface elements,” U.S. Pat. No. 6,498,607 “Method for generating graphical object represented as surface elements,” U.S. Pat. No. 6,480,190 “Graphical objects represented as surface elements,” U.S. Pat. No. 6,448,968 “Method for rendering graphical objects represented as surface elements,” U.S. Pat. No. 6,396,496 “Method for modeling graphical objects represented as surface elements,” U.S. Pat. No. 6,342,886, “Method for interactively modeling graphical objects with linked and unlinked surface elements.” That work has been extended to include high-quality interactive rendering using splatting and elliptical weighted average filters. Hardware acceleration can be used, but the pre-processing and set-up still limit performance.
Qsplat is a progressive point sample system for representing and displaying a large object. Qsplat uses a view-dependent progressive transmission technique for a multi-resolution rendering system. Static objects are represented by a multi-resolution hierarchy of point samples based on bounding spheres. Splatting is used to render the point samples. Extensive pre-processing is relied on for splat size and shape estimation, see Rusinkiewicz et al., “QSplat: A multi-resolution point rendering system for large meshes,” Proceedings of SIGGRAPH 2000, pp. 343-352, 2000.
Botsch et al. use an octree data structure for storing point sampled geometry, see “Efficient high quality rendering of point sampled geometry,” Proceedings of the 13th Eurographics Workshop on Rendering, pp. 53-64, 2002. Typical data sets can be encoded with less than five bits per point for coding tree connectivity and geometry information. When surface normals and color attributes are included, the bit requirements double or triple. A similar compression performance is achieved by a progressive encoding scheme for isosurfaces using an adaptive octree and fine level placement of surface samples, see Lee et al., “Progressive encoding of complex isosurfaces,” Proceedings of SIGGRAPH 2003, ACM SIGGRAPH, pp. 471-475, July 2003.
Briceno et al. in “Geometry videos,” Proceedings of ACM Symposium on Computer Animation 2003, July 2003, reorganize data from dynamic 3D objects into 2D images. That representation allows video compression techniques to be applied to animated polygon meshes. However, they cannot deal with unconstrained free viewpoint video as described below.
Vedula et al. in “Spatio-temporal view interpolation,” Proceedings of the 13th ACM Eurographics Workshop on Rendering, June 2002, describe a free viewpoint video system based on the computation of a 3D scene flow and spatio-temporal view interpolation. However, they do not address the coding of the 3D scene flow representation.
Another method is described by Wuermlin et al., “3D video fragments: Dynamic point samples for real-time free viewpoint video,” Computers & Graphics 28(1), Special Issue on Coding, Compression and Streaming Techniques for 3D and Multimedia Data, Elsevier Ltd, 2003, also see U.S. patent application Ser. No. 10/624,018, “Differential Stream of Point Samples for Real-Time 3D Video, filed on Jul. 21, 2003, by Würmlin et al., incorporated herein by reference. That method uses point samples as a generalization of 2D video pixels into 3D space. A point sample holds, additionally to its color, a number of geometrical attributes. The geometrical attributes guarantee a one-to-one relation between 3D point samples and foreground pixels in respective 2D video images. That method does not make any assumptions about the shape of the reconstructed object. The problem with that method is that both the encoder and decoder need to maintain a detailed 3D point model. The progressive sampling used in that system increases the complexity of the system, particularly in the decoder.
None of the prior art methods provide an efficient compression framework for an arbitrary viewpoint video of moving 3D objects.
Therefore, there still is a need for encoding multiple videos acquired of moving 3D objects by fixed cameras, and decoding the encoded bitstream for arbitrary viewpoints.