If there is one thing that instantly characterizes humans, it is their faces. Hereinafter, faces of the same person are said to have identical ‘identities’ or ‘appearances’, no matter what the pose, age, or health of the face is. That is, the overall appearance of the face uniquely makes that face recognizable as being a certain person, even as the person ages over the years. Similarly, faces of different appearances, i.e., different individuals, can have the same ‘expression’, for example, smiling, angry, laughing, sad, tense, sneering, serious, quizzical, frowning, scowling, snarling, etc. That is, even though faces of different persons have distinctly different overall appearances, humans can easily recognize when a person is smiling or crying.
Even though, we, as humans, can readily distinguish the subtle differences between faces having different appearances and expressions, generating realistic and convincing facial animation is an extremely difficult and time intensive process, requiring highly detailed models and skillful animators.
The dominant approach is to vary a three-dimensional geometrical model with a basic set of deformations. Generating these models, adapting the models to target characters, and controlling the models are all major bottlenecks in the production process.
It is well known that variation in faces can be approximated by linear subspaces of low dimensions, whether a source of variation is an appearance (or “identity” of a person's face), a pose, an expression, or a shading pattern, Sirovich et al., “Low dimensional procedure for the characterization of human faces,” Journal of the Optical Society of America A 4, pp. 519–524, 1987, and Penev et al., “The global dimensionality of face space,” Proc. 4th Int'l Conf. Automatic Face and Gesture Recognition, IEEE CS, pp. 264–270, 2000.
The estimation and exploitation of these linear subspaces accounts for a large part of the prior art, notably Li et al. “3-D motion estimation in model-based facial image coding,” IEEE Trans. PAMI 15, 6, pp. 545–555, Jun. 1993, DeCarlo et al., “The integration of optical flow and deformable models with applications to human face shape and motion estimation,” Proceedings, CVPR96, pp. 231–238, 1996, Bascle et al., “Separability of pose and expression in facial tracking and animation,” Proc. ICCV, p. 323–328, 1998, and Bregler et al., “Recovering non-rigid 3D shape from image streams,” Proc. CVPR, 2000.
In computer graphics, these subspaces, known as morphable models, are a mainstay of character animation and video rewrite, Blanz et al., “A morphable model for the synthesis of 3D faces,” Proc. SIGGRAPH99, 1999, and Pighin et al. “Synthesizing realistic facial expressions from photographs,” Proceedings of the 25th annual conference on Computer graphics and interactive techniques, ACM Press, pp. 75–84, 1998.
Morphable appearance models are well suited for adding 3D shape and texture information to 2D images, while morphable expression models can be used for tracking and performance animation.
In consideration of the needs of animators, there have been many attempts to combine identity and expression spaces by adapting a morphable expression model to a new person.
However, such models can produce unnatural or insufficiently varied results because the models graft the expressions of the original subject, modeled as deformations of a neutral facial geometry, onto the geometry of another face.
As stated by Blanz et al. 2003, “We ignore the slight variations across individuals that depend on the size and shape of faces, characteristic patterns of muscle activation, and mechanical properties of skin and tissue.”
It is well known in computer vision that variation in facial images is better modeled as being multilinear in pose and expression, identity, lighting, or any combination thereof. Put simply, whatever the function that generates face images, a multilinear model will capture more terms of its first-order Taylor approximation than a linear model, thus multilinear models can offer better approximations.
Most important for animation, multilinear models offer separability of attributes so that the models can be controlled independently. In general, separability is not compatible with statistical efficiency in linear subspace models, except in the vastly improbable case that all variations between people are orthogonal to all variations between expressions. This is not possible in a world where gravity endows older faces with a natural frown.
As with linear models, the main empirical observation is that the data approximation offered by multilinear models is quite good, in particular, the efficacy of multilinear models for recognition and synthesis of image and motion capture data, Vasilescu et al., “Multilinear analysis of image ensembles: Tensorfaces,” 7th European Conference on Computer Vision(ECCV2002)(Part I), pp. 447–460, 2002, and Vasilescu, “Human motion signatures: Analysis, synthesis, recognition,” Proc. ICPR, 2002.
Another appeal of those methods is their simplicity of use. A linear morphable model is easily estimated from a matrix of example faces via a singular value decomposition (SVD), and connected to vision or rendering through simple linear algebra.
Similarly, a multilinear model can be estimated from a tensor of example images via higher-order singular value decomposition (HOSVD), a generalization of SVD, Tucker, “The extension to factor analysis to three-dimensional matrices,” Contributions to mathematical psychology, Gulliksen et al., Eds, Holt, Rinehard & Winston, N.Y., pp. 109–127, 1964, Lathauwer et al., “A multilinear singular value decomposition,” SIAM J. Matrix Analysis and Applications 21, 4, pp. 1253–1278, 1994, and Lathauwer, “Signal Processing based on Multilinear Algebra,” Ph.D. Thesis, Katholieke Universiteit Leuven, Belgium, 2000.