Image-based rendering (IBR) was introduced in the pioneering work of Levoy et al. [LH96] and Gortler et al. [GGSC96]. The basic goal is simple: IBR strives to create a sense of a 3D real-world scene based on captured image data. Many subsequent works have explored the theoretical foundations, e.g., the dependency of geometry and images in respect to a minimal sampling requirement [CCST00], or developed more efficient and less restrictive implementations [BBM*01]. One important general insight from these works is that a sufficiently accurate geometric proxy of the scene reduces the number of required input images considerably.
A small number of input views is an important prerequisite in order to apply IBR in real-world environments and applications. One prominent example is sports broadcasting, where we observe a growing demand for free-viewpoint replay for scene analysis. However, for these and most other non-studio applications, IBR should ideally work based on existing infrastructure such as manually operated TV cameras. This poses the fundamental question how we can robustly generate a sufficiently accurate geometric proxy, despite the wide-baseline cameras, uncontrolled acquisition conditions, low texture quality and resolution, and inaccurate camera calibration. These problems become even more severe for processing video sequences instead of still images. Under these challenging real-world conditions, classical 3D reconstruction techniques such as visual hulls [MBR*00] or multi-view stereo [Mid09] are generally inapplicable. Due to the involved difficulties, one of the currently most popular approaches in this field is still the use of simple planar billboards [HS06], despite the unavoidable visual artifacts such as ghosting.
A variety of different 3D representations and rendering methods exists that use images or videos as a source. Most of them are tightly connected to particular acquisition setups:
If many cameras with different viewpoints are available, the light field [LH96] of the scene can be computed, which represents the radiance as a function of space. Buehler et al. [BBM*01] generalize this approach to include geometric proxies. The Eye-Vision system used for Super Bowl [Eye09] uses more than 30 controlled cameras for replays of sports events. The method by Reche et al.
[RMD04] for trees requires 20-30 images per object. A recent approach by Mahajan et al. [MHM*09] uses gradient-based view interpolation. In contrast to these methods, our method does not require a dense camera placement.
Many methods additionally use range data or depth estimation in their representation. Shade et al. [SGwHS98] use estimated depth information for rendering with layered depth images. Waschbüsch et al. [WWG07] use colour and depth to compute 3D video billboard clouds, that allow high quality renderings from arbitrary viewpoints. Pekelny and Gotsman [PG08] use a single depth sensor for reconstructing the geometry of an articulated character. Whereas these methods require either depth data or accurate and dense silhouettes, this is not available in uncontrolled scenes with only a few video cameras and weak calibrations.
Several methods for template-based silhouette matching were proposed for controlled studio setups [CTMS03,VBMP08,dAST*08]. For free-viewpoint rendering, the camera images are blended onto the surface of a matched or deformed template model. However, these methods require accurate source images from studio setups whereas articulated billboards can be used with sparsely placed and inaccurately calibrated cameras. In these situations, the geometry of articulated billboards is much more robust against errors than, e.g., a full template body model where the texture has to be projected accurately onto curved and often thin (e.g. an arm) parts. Moreover, the generally required highly tessellated 3D template models are not efficient for rendering the often small subjects with low texture quality and resolution. Debevec et al. [DTM96] proposed a method that uses stereo correspondence with a simple 3D model. However, it applies to architecture and is not straight-forward extendable to articulated figures without straight lines.
Recently, improved methods for visual hulls, the conservative visual hull and the view dependent visual hull, showed promising results [GTH*07,KSHG07]. However, these methods are based on volume carving that requires selected camera positions to remove non-body parts on all sides of the subject. Our method does not require a special camera setting and can already be used with only two source cameras to show, e.g., a bird's eye perspective from a viewpoint above the positions of all cameras. Recent work by Guillemaut et al. [GKH09] addresses many challenges for free-viewpoint video in sports broadcasting by jointly optimizing scene segmentation and multi-view reconstruction. Their approach is leading to a more accurate geometry than the visual hull, but still requires a fairly big number of quite densly placed cameras (6-12). We compare our method to their reconstruction results in Section 7.
A simple method for uncontrolled setups is to blend between billboards [HS06] per subject and camera. However, such standard billboards suffer from ghosting artifacts and do not preserve the 3D body pose of a person due to their planar representation. The idea to subdivide the body into parts represented by billboards is similar in spirit to the billboard clouds representation [DDS03,BCF*05], microfacets [YSK*02,GM03] or subdivision into impostors [ABB*07,ABT99]. However, these methods are not suited for our target application, since they rely on controlled scenes, depth data or even given models. Lee et al. [LBDGG05] proposed a method to extract billboards from optical flow. However, they used generated input images from synthetic models with high quality.
Related to our approach is also the quite large body of work on human pose estimation and body segmentation from images. Here, we can only discuss the most relevant works. Efros et al. [EBMM03] have presented an interesting approach for recognizing human action at a distance with applications to pose estimation. Their method requires an estimate of the optical scene flow which is often difficult to estimate in dynamic and uncontrolled environments. Agarwal and Triggs [AT06], Jaeggli et al. [JKMG07], and Gammeter et al. [GEJ*08] present learning-based methods for 3D human pose estimation and tracking. However, the computed poses are often only approximations, whereas we require accurate estimations of the subject's joint positions. Moreover, we generally have to deal with a much lower image quality and resolution in our setting. We therefore present a semi-automatic, data-driven approach, since a restricted amount of user interaction is acceptable in many application scenarios if it leads to a considerable improvement in quality.