The recently released photo-realistic CGI movie Beowulf provides an impressive foretaste of how many movies will be produced as well as displayed in the future (Paramount 2007). In contrast to previous animated movies, the goal was not the creation of a cartoon style appearance but a photo-realistic display of the virtual sets and actors. It still takes a tremendous effort to create authentic virtual doubles of real-world actors. It remains one of the biggest challenges to capture human performances, i.e., motion and possibly dynamic geometry of actors in the real world in order to map them onto virtual doubles. To measure body and facial motion, the studios resort to marker-based optical motion capture technology. Although this delivers data of high accuracy, it has a number of limitations. Marker-based motion capture requires a significant setup time, expects subjects to wear unnatural skin-tight clothing with optical beacons, and often makes necessary many hours of manual data cleanup. In this regard, the studios are unable to capture human performances densely in space and time where there would be an accurate capture of dynamic shape, motion and textural appearance of actors in arbitrary everyday apparel.
Many recent motion capture algorithms have largely focused on capturing sub-elements of the sophisticated scene representation that are the subject of reconstruction. Marker-based optical motion capture systems are the workhorses in many game and movie production companies for measuring motion of real performers. The high accuracy of this approach comes at the price of restrictive capturing conditions and the typical requirement of the subjects to wear skin-tight body suits and reflective markings; such conditions make it infeasible to capture shape and texture. Others have attempted to overcome these conditions by using several hundred markers to extract a model of human skin deformation. While their animation results are very convincing, manual mark-up and data cleanup times can be tremendous in such a setting, and generalization to normally dressed subjects is difficult. Such marker-based approaches, by definition, require the scene to be modified by the burdensome inclusion of the markers.
Marker-less motion capture approaches are designed to overcome some restrictions of marker-based techniques and enable performance recording without optical scene modification. Although such approaches are more flexible than the intrusive (marker-based) methods, they have difficulty achieving the same level of accuracy and the same application range. Furthermore, since such approaches typically employ kinematic body models, it is hard to capture motion, let alone detailed shape, of people in loose everyday apparel. Some methods try to capture more detailed body deformations in addition to skeletal joint parameters by adapting the models closer to the observed silhouettes, or by using captured range scan. But such algorithms generally require the subjects to wear tight clothes. Only a few approaches aim at capturing humans wearing more general attire, for example, by jointly relying on kinematic body and cloth models. Unfortunately, these methods typically require handcrafting of shape and dynamics for each individual piece of apparel, and they focus on joint parameter estimation under occlusion rather than accurate geometry capture. Other related work explicitly reconstructs highly-accurate geometry of moving cloth from video. Such methods also use visual interference with the scene in the form of specially tailored color patterns on each piece of garment which impedes simultaneous acquisition of shape and texture.
A slightly more focused but related concept of performance capture is put forward by 3D video methods which aim at rendering the appearance of reconstructed real-world scenes from new synthetic camera views never seen by any real camera. Early shape-from-silhouette methods reconstruct rather coarse approximate 3D video geometry by intersecting multi-view silhouette cones. Despite their computational efficiency, the moderate quality of the textured coarse scene reconstructions often falls short of production standards in the movie and game industry. To boost 3D video quality, researchers experimented with image-based methods, multi-view stereo, multi-view stereo with active illumination, or model-based free-viewpoint video capture. The first three methods do not deliver spatio-temporally coherent geometry or 360 degree shape models, which are both essential prerequisites for animation post-processing. At the same time, previous kinematic model-based 3D video methods were not well suited to capture performers in general clothing. Data-driven 3D video methods synthesize novel perspectives by a pixel-wise blending of densely sampled input viewpoints. While even renderings under new lighting can be produced at high fidelity, the complex acquisition apparatus requiring hundreds of densely spaced cameras makes practical applications often difficult. The lack of geometry makes subsequent editing a major challenge.
More recent animation design, animation editing, deformation transfer and animation capture methods have been proposed that are no longer based on skeletal shape and motion parameterization but rely on surface models and general shape deformation approaches. This abandonment of kinematic parameterizations makes performance capture a much harder problem.
Similarly, certain other approaches enable mesh-based motion capture from video, which involves generation of a 3D (or volumetric) deformable model, and also employ laser-scanned models and a more basic shape deformation framework. Another recent approach is based on animation reconstruction methods that jointly perform model generation and deformation capture from scanner data. However, their problem setting is different and computationally very challenging which makes it hard for them to generate high visual quality. Other approaches have proposed techniques that are able to deform mesh-models into active scanner data or visual hulls, respectively.