There is an extensive literature on (and commercial solutions for) estimating skeleton proxies from marker sets. Since MoSh does not use a skeleton, these methods are not reviewed here. Instead, it is focused on several key themes in the literature that more directly relate to the work: fitting models to sparse markers, dense marker sets, and surface capture.
From Markers to Models: To get body shape from sparse markers, one needs a model of body shape to constrain the problem. There have been several previous approaches.
ALLEN, B., CURLESS, B., AND POPOVIC, Z. 2003. The space of human body shapes: Reconstruction and parameterization from range scans. ACM Trans. Graph. (Proc. SIGGRAPH) 22, 3, 587-594, learn a model of body shape variation in a fixed pose from 3D training scans.
ANGUELOV, D., SRINIVASAN, P., KOLLER, D., THRUN, S., RODGERS, J., AND DAVIS, J. 2005. SCAPE: Shape Completion and Animation of People. ACM Trans. Graph. (Proc. SIG-GRAPH 24, 3, 408-416 go further to learn a model that captures both body shape and non-rigid pose deformation.
Allen et al. show that one can approximately recover an unknown 3D human shape from a sparse set of 74 landmarks. They do this only for a fixed pose since their model does not represent pose variation. Importantly the landmarks are perfect and known; that is, they have the 3D points on the mesh they want to recover and do not need to estimate their location on the mesh. Unlike MoSh this does not address the problem of estimating body shape and pose from mocap markers alone.
Anguelov et al. [2005] show how to animate a SCAPE model from motion capture markers. Their method requires a 3D scan of the subject with the markers on their body. This scan is used for two purposes. First it is used to estimate the 3D shape model of the person; this shape is then held fixed. Second the scanned markers are used to establish correspondence between the scan and the mocap markers. These limitations mean that the approach cannot work on archival mocap data and that a user needs both a 3D body scanner and a mocap system.
It is important to note that Anguelov et al. did not solve the problem addressed by MoSh. They fit a SCAPE model to a 3D body scan (what they call shape completion) and with known marker locations, animate the model from mocap markers. It is gone go beyond their work to estimate the body shape from only the sparse mocap markers without the use of any scan and without knowing their precise location on the body. This is done by simultaneously solving for the marker locations, the shape of the body and the pose using a single objective function and optimization method. Unlike [Anguelov et al. 2005], MoSh is fully automatic and applicable to archival data.
It is gone also beyond previous work to define new marker sets and evaluate the effect of these on reconstruction accuracy. This provides a guide for practitioners to choose appropriate marker sets.
Dynamics of Soft Tissue: Unlike MoSh, the above work does not address the capture of soft tissue motion. Interestingly, much of the attention paid to soft-tissue motion in the mocap community (particularly within biomechanics) actually focuses on minimizing the effects of soft tissue dynamics, as disclosed in LEARDINI, A., CHIARI, L., CROCE, U. D., AND CAPPOZZO, A. 2005. Human movement analysis using stereophotogrammetry: Part 3. soft tissue artifact assessment and compensation. Gait & Posture 21, 2, 212-225. Soft tissue motion means the markers move relative to the bones and this reduces the accuracy of the estimated skeletal models. For animation, it is argued that such soft tissue motions are actually critical to making a character look alive.
Dense Marker Sets: To capture soft-tissue motion, previous work has used large, dense, marker sets. PARK, S. I., AND HODGINS, J. K. 2006. Capturing and animating skin deformation in human motion. ACM Trans. Graph. (Proc. SIGGRAPH) 25, 3 (July), 881-889, use 350 markers to recover skin deformation; in the process, they deform a subject-specific model to the markers and estimate missing marker locations. In PARK, S. I., AND HODGINS, J. K. 2008. Data-driven modeling of skin and muscle deformation. ACM Trans. Graph. (Proc. SIGGRAPH) 27, 3 (August), 96:1-96:6, they use a large (400-450) marker set for ≈10,000 frames of activity to create a subject-specific model; this model can then be used to recover pose for the same subject in later sessions with a sparse marker set. In these works, the authors visualize soft-tissue deformations on characters resembling the mocap actor. Here soft-tissue deformations are transferred to more stylized characters.
HONG, Q. Y., PARK, S. I., AND HODGINS, J. K. 2010. A data-driven segmentation for the shoulder complex. Computer Graphics Forum 29, 2, 537-544, use 200 markers on the shoulder complex and a data driven approach to infer a model of shoulder articulation. While dense markers can capture rich shape and deformation information, they are not practical for many applications. Placing the markers is time consuming and a large number of markers may limit movement. With these large sets, additional challenges emerge in dealing with inevitable occlusions and marker identification.
Recent work captures skin deformations using a dense set of markers or patterns painted on the body, like BOGO, F., ROMERO, J., LOPER, M., AND BLACK, M. J. 2014. FAUST: Dataset and evaluation for 3D mesh registration. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) or NEUMANN, T., VARANASI, K., HASLER, N., WACKER, M., MAGNOR, M., AND THEOBALT, C. 2013. Capture and statistical modeling of arm-muscle deformations. Computer Graphics Forum 32, 2 (May), 285-294. The work is similar to Park and Hodgins but uses computer vision methods rather than standard mocap markers.
The present invention differs in that it conforms to standard mocap practice and is backwards-compatible with existing sparse marker sets. The goal of MoSh is to get more out of sparse markers.
Surface Capture: At the other extreme from sparse markers are methods that capture full 3D meshes at every time instant, like DE AGUIAR, E., STOLL, C., THEOBALT, C., AHMED, N., SEIDEL, H.-P., AND THRUN, S. 2008. Performance capture from sparse multi-view video. ACM Trans. Graph. (Proc. SIGGRAPH) 27, 3 (August), 98:1-98:10 or STARK, J., AND HILTON, A. 2007. Surface capture for performance-based animation. IEEE Computer Graphics and Applications 27, 3, 21-31; this can be conceived of as a very dense marker set. Still other methods use a scan of the person and then deform it throughout a sequence, like DE AGUIAR, E., THEOBALT, C., STOLL, C., AND SEIDEL, H.-P. 2007. Marker-less deformable mesh tracking for human shape and motion capture. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 1-8 or LIU, Y., GALL, J., STOLL, C., DAI, Q., SEIDEL, H.-P., AND THEOBALT, C. 2013. Markerless motion capture of multiple characters using multiview image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 11, 2720-2735.
Existing methods for surface capture rely on multi-camera computer vision algorithms that are computationally expensive compared with commercial marker-based systems. These methods are most applicable to capturing complex surfaces like clothing or breathing that are difficult to parametrize, like TSOLI, A., MAHMOOD, N., AND BLACK, M. J. 2014. Breathing life into shape: Capturing, modeling and animating 3D human breathing. ACM Trans. Graph., (Proc. SIGGRAPH) 33, 4 (July), 52:1-52:11. In the case of body shape, it is found that, together with a parametric body model, a small marker set is already very powerful.
DE AGUIAR, E., ZAYER, R., THEOBALT, C., SEIDEL, H. P., AND MAGNOR, M. 2007. A simple framework for natural animation of digitized models. In Computer Graphics and Image Processing, 2007. SIBGRAPI 2007. XX Brazilian Symposium on, 3-10, in a related approach use an intermediate template that is animated in a traditional way from mocap markers. They then transfer the template motion to a more complex mesh. Like MoSh this method is motivated by standard practice but it still indirects through a crude proxy, rather than solving directly for shape and pose from markers.
Attribute Capture: The idea that markers contain information about body shape is not new. LIVNE, M., SIGAL, L., TROJE, N., AND FLEET, D. 2012. Human attributes from 3D pose tracking. Computer Vision and Image Understanding 116, 5, 648-660, use motion capture data to extract socially meaningful attributes, such as gender, age, mental state and personality traits by applying 3D pose tracking to human motion. This work shows that a sparse marker set contains rich information about people and their bodies. MoSh takes a different approach by using the sparse marker data to extract faithful 3D body shape. Like Livne et al., it is shown that gender can be estimated from markers. Beyond this, it is suspected that the full 3D body model can be used to extract additional attributes.
Motion Magnification. There has been recent work on magnifying small motions in video sequences, like WANG, H., XU, N., RASKAR, R., AND AHUJA, N. 2007. Videoshop: A new framework for spatio-temporal video editing in gradient domain. Graph. Models 69, 1, 57-70; WU, H.-Y., RUBINSTEIN, M., SHIH, E., GUTTAG, J., DURAND, F., AND FREEMAN, W. T. 2012. Eulerian video magnification for revealing subtle changes in the world. ACM Trans. Graph. (Proc. SIGGRAPH) 31, 4 (July), 65:1-65:8; or WADHWA, N., RUBINSTEIN, M., DURAND, F., AND FREEMAN, W. T. 2013. Phase-based video motion processing. ACM Trans. Graph., (Proc. SIGGRAPH) 32, 4 (July), 80:1-80:10; but less work on magnifying 3D motions.
In part this may be because capturing 3D surface motions is difficult. Other work exaggerates mocap skeletal motions using mocap data, like KWON, J.-Y., AND LEE, I.-K. 2007. Rubber-like exaggeration for character animation. In Proceedings of the 15th Pacific Conference on Computer Graphics and Applications, IEEE Computer Society, Washington, D.C., USA, PG '07, 18-26.
NEUMANN, T., VARANASI, K., WENGER, S., WACKER, M., MAGNOR, M., AND THEOBALT, C. 2013. Sparse localized deformation components. ACM Trans. Graph. 32, 6 (November), 179:1-179:10 develop methods for spatially localized modeling of deformations and show that these deformations can be edited and exaggerated.
JAIN, A., THORMAHLEN, T., SEIDEL, H.-P., AND THEOBALT, C. 2010. MovieReshape: Tracking and reshaping of humans in videos. ACM Transactions on Graphics (Proc. SIGGRAPH) 29, 6 (December), 148:1-148:10 edit body shape to exaggerate it but do not model or amplify non-rigid soft-tissue dynamics. While the exaggeration of facial motion has received some attention, this is the first work to use only sparse marker sets to extract full-body soft tissue motion for exaggeration.
In summary, MoSh occupies a unique position—it estimates 3D body shape and deformation using existing mocap marker sets. MoSh produces animated bodies directly from mocap markers with a realism that would be time consuming to achieve with standard rigging and skeleton-based methods.