Capturing a full 3D video containing real human performance has become one of the hot topics in the fields of computer vision and graphics. With a reconstructed geometry sequence, applications, e.g., free-viewpoint video (FVV), are recently developed to allow remote users to observe physically realistic motion and appearance at any viewpoint, and bring them an immersive experience when wearing virtual/augmented reality (VR/AR) hardware. The core technology behind this is to capture the performance with multi-view color cameras, single or multiple depth sensors, or their hybrid combination.
Performance capture in the past decade has been evolving from starting with template models or fully prescanned 3D actors and fitting them over time to the captured sequence, to reconstructing a 4D (spatio and temporal) geometry which evolves in real-time while capturing. The former restricts the capture to a particular scene with only the same template or actor, and the latter, referring to temporal fusion, which works with a general scene without any template prior, has attracted more attention from both academia and industry.
Although considerable efforts have been devoted to dynamic scene fusion (e.g., DynamicFusion, VolumeDeform, BayesianFusion, Fusion4D), the main focus is on improving the model quality and completeness in reconstruction. Since temporal registration of a large scene relies on a solution searched in an extraordinarily large space, the performance to capture is usually assumed to be slow motion and outlier free (e.g., multiple depth sensors and cameras). On the other hand, the registration error will still accumulate frame by frame to prevent tracking for a long time. After tracking a mesh successfully over dozens of frames, some triangles become overly deformed or topological changes occur, and the reference model needs to be reset. Therefore previous fusion methods prefer to a flexible way to store an independently reconstructed mesh for each time frame, which is simply disposed over time or cached leading to an unstructured sequence costing a huge amount of bandwidth or memory.
FVV is video which allows a user to change the viewpoint of the video at any time. For example, a user watching a sports video could change from watching the sports video from a perspective behind home plate to a perspective from the outfield. This enables users/viewers to view the content from a unique perspective.