Achieving a digital double which can replicate the facial appearance and motion of a real actor requires a facial animation rig which can reproduce the shape a real face traverses as it moves. Digital double face rigs typically incorporate a neutral pose geometry derived from a high-quality 3D scan of the actor's face, along with a collection of blendshapes which are based on reference and/or scans of the actor's face in a variety of poses.
Blendshapes are combined with a base shape, thereby deforming the base shape, to achieve numerous pre-defined shapes and various combinations in-between. The base shape, such as a single mesh, is the default shape (an expressionless face for example). Various expressions, such as smiling, laughing, frowning, growling, yelling, closed eyes, open eyes, heightened eyebrows, lowered eyebrows, pursed lips, and mouth shapes of vowels or consonants, blend or morph into the base shape and, in so doing, are referred to as blendshapes or morph targets. For purposes of this specification, the term blendshape and morph target shall be used interchangeably.
Collections of blendshapes are linked to the neutral base shape by a blend node, operator or modifier. The intensity for each shape that is linked to the base shape can be modified, thereby linearly increasing or decreasing the prominence of that shape in the image. An animator can then “turn up” or “turn down” the blendshape poses, and the base shape will animate to partially or fully assume the target shape to form the desired expression.
Using this method, animators can mix and match blendshapes to form any number of combinations between the prepared and linked blendshapes. A raised eyebrow can be mixed with a grin to form a quizzical expression. A blendshape with a frown can be mixed with downward eyebrows eyes to form an angry expression or a look of disapproval. For maximum flexibility, artists will often breakdown and create isolated expressions (blendshapes) just for each individual facial component (nose, eyes, eyelids, eyebrows, and mouth) and even those expressions may be further divided into left and right versions thereof, affecting only the left or right portions of the face.
To span a sufficient array of actor-specific facial expressions which is consistent across human faces, this collection of blendshapes is usually designed to isolate muscle group action units according to the Facial Action Coding System (FACS). Combinations of these shapes are actuated by controls which are exposed to an animator and/or driven by performance capture data.
There are several challenges associated with trying to obtain the basis expressions from captured poses of an actor's face. First, rigid head movement must be factored out to yield face motion in a common coordinate system for basis selection or animation curve solving. Rigid head transforms may be estimated from the movement of a rigid object attached to the performer's head, but such estimates suffer when the attached object moves relative to the head during large accelerations. Rigid head transforms may also be estimated from a set of facial markers which exhibit minimal motion compared to the rest of the face; however, they do not truly remain still during a performance.
Second, there is the question of what constitutes an appropriate “base” or “neutral” pose. On the acquisition side, this should be a relaxed facial expression which the actor can hold, as well as reproduce consistently across scanning sessions and/or modes of capture. However in reality the actor is unlikely to produce the same relaxed facial expression from shot to shot, therefore it is difficult to judge which one is the “true” base pose. Crucially, several of the desired poses may be difficult for some or all individuals to perform. Furthermore, even easy-to-perform expressions are impractical to achieve in isolation. For example, a captured open jaw expression might contain 5% of an upper eyebrow raised expression. The captured shapes have to be carefully processed by manually painting out any undesirable motion, in order to produce clean basis shapes.
Significant production time is spent decomposing retopologized face shapes into localized, meaningful poses. The quality of the results will depend highly on the skill of the artist. An ideal decomposition relies on the artist's foresight into how the basis shapes, prepared individually, will combine during animation.
For facial rigging and animation, bone-based and blendshape rigs are the two most typical representations in use. Bone-based rigs allow affine deformation, however it is non-trivial to build a bone-based rig based on facial measurements as deriving optimal skinning weights and bone locations is non-trivial. In addition, even under the assumption that the skinning weights and bone locations are known, referring joint transformations based on positional constraints is essentially an inverse kinematics problem, which could lead to multiple solutions. Blendshape rigs, on the other hand, are much easier to work with individual facial captures. However, they typically do not incorporate rotations. In addition, the nature of the blendshape lures digital artists to add more shapes for a better approximation of non-linear deformation; nevertheless this could end up with introducing linearly dependent shapes and resulting in confusing facial solving results.
Linear blendshape models have been adopted for many facial animation applications. However, a major disadvantage of various prior art approaches is that it is a holistic model and all the components therein are related to each other and have global support. This makes it difficult to focus on, and isolate, localized deformations. There are attempts to automatically discover localized facial action units by dividing a human face into several clusters based on analyzing the motion data or inducing sparsity. Nevertheless, there are two major issues that make these data-driven methods impractical for production use. Firstly, they require a substantial amount of data to adequately train a facial deformation model. Secondly, since the animation model is data-driven, it is difficult to obtain consistent results (for example, segmentations and/or corresponded motions) across different subjects. There are methods to model out-of-span deformations as correctives, but these approaches do not alter the rig itself, and the extra shapes are difficult to interpret if further editing is desired.
There is therefore a need for a method and system that addresses the above challenges in producing digital double facial animation rigs. Such a system should be able to produce a set of blendshapes capable of reproducing input performances captured for a given actor, while conforming to general semantics such that the rigs produced for each actor behave consistently in the hands of an animator. Additionally, there is a need for methods and systems that provide an easy, dynamic approach to modifying a generated blendshape rig, based upon a predefined set of scalar parameters that would enable a user to readily generate new blendshape rigs as desired. Further, there is a need for methods and systems that can iteratively, automatically generate desired blendshapes from an actor's performance and a database of existing blendshape templates.