The present disclosure relates to a method for real-time facial animation and, in particular, to a processing device and a real-time facial animation system. Moreover, the disclosure relates to a dynamic expression model which may include a plurality of blendshapes that may be used to track facial expressions of a user in order to generate a corresponding graphical representation.
Recent advances in real-time performance capture have brought within reach a new form of human communication. Capturing dynamic facial expressions of users and re-targeting these facial expressions on digital characters enables a communication using virtual avatars with live feedback. Compared to communication via recorded video streams that only offer limited ability to alter the appearance of users, facial animation opens the door to fascinating new applications in computer gaming, social networks, television, training, customer support, or other forms of online interactions. However, a successful deployment of facial animation technology at a large scale puts high demands on performance and usability.
State of the art marker-based systems, multi-camera capture devices, or intrusive scanners commonly used in high-end animation productions are not suitable for consumer-level applications. Equally inappropriate are methods that require complex calibration or necessitate extensive manual assistance to set up or create the system. Several real-time methods for face-tracking have been proposed. Yet, video-based methods typically track a few facial features and often lack fine-scale detail which limits the quality of the resulting animations. Tracking performance can also degrade in difficult lighting situations that, for example, commonly occur in home environments.
State of the art approaches also require an a prior creation of a tracking model and extensive training which requires the building of an accurate three-dimensional (3D) expression model of the user by scanning and processing a predefined set of facial expressions. Beyond being time consuming, such pre-processing is also erroneous. Users are typically asked to move their head in front of a sensor in specific static poses to accumulate sufficient information. However, assuming and maintaining a correct pose (e.g., keeping the mouth open in a specific, predefined opening angle) may be exhaustive and difficult and often requires multiple tries. Furthermore, manual corrections and parameter tuning is required to achieve satisfactory tracking results. Hence, user-specific calibration is a severe impediment for deployment in consumer-level applications.
Animating digital characters based on facial performance capture is known in the art. For example, marker-based systems are used to capture real-time performances, wherein explicit face markers may be placed on the face of a user in order to simplify tracking. However, the face markers limit the amount of spatial detail that can be captured. Systems utilizing a single camera to record facial performances often lead to a substantially low tracking quality involving artifacts in the generated face animations. Performance capture systems based on dense 3D acquisition, such as structured light scanners or multi-view camera systems, are capable of capturing fine-scale dynamics, however, require a significant amount of processing time, thereby impeding interactive frame rates. Moreover, systems applying a combination of markers and 3D scanning often require specialized hardware set-ups that need extensive and careful calibration.