Facial expression measurement has become extremely valuable in numerous fields such as performance-driven facial animation in media production, behaviour analysis, cognitive studies, product design analytics, and measuring emotional responses to advertising.
Methods heretofore developed for measuring facial expressions are typically limited to professional applications and capture scenarios insofar as they require constrained capture environments, specialized hardware, trained operators, physical facial markers and long setup times. These requirements make such technologies impractical for some professional and most consumer applications.
For example, media producers are continually striving to provide more life-like appearance to their digitally generated characters, particularly in the challenging field of creating realistic facial animation. One common method of achieving this is to capture the visual performance of an actor, in addition to their voice, in order to transfer the facial movements and expressions onto the digital character to coincide with the sound track. Currently, this requires a highly skilled animator with the ability to identify the multiple movements of the performer's face which form the speech and expressions, and map them onto a complex set of animation controls of the digital character. Due to the highly specialized, and time-consuming, task of interpreting and transferring often subtle facial movements onto a digitally animated character, this technique has traditionally been restricted to large production motion pictures and high-end video games.
Approaches have been developed to provide some level of automation for performance capture, in particular, in identifying and encoding the movements of specific facial locations within the performer's face.
Existing commercial approaches aimed at performance capture of actors in animation production use specialized performance capture environments to increase the reliability by with which movements can be detected. One example of such an approach includes placing physical markers, designed to be easily identifiable and accurately located by the vision system, on the performer's face, as described in US Patent Application 2011/0110561, to Havaldar, entitled “Facial Motion Capture Using Marker Patterns That Accommodate Facial Surface,” and incorporated herein by reference. Another approach uses multiple cameras, as described in US Patent 2010/0271368 A1, McNamara et al. “Systems and Methods for Applying a 3d Scan of a Physical Target Object to a Virtual Environment,” which is also incorporated herein by reference. These approaches are able to locate facial features accurately, however, they also increase the skill level, set-up times and costs required during capture.
Other more generic approaches of locating facial features in video sequences include various statistical models of shape and appearance which have been proposed for use in without requiring specialized capture environments or placement of optical marker to be placed on a face, as described, for example, by Gao, et al., “A review of active appearance models,” Trans. Sys. Man Cyber Part C, vol. 40(2), pp. 145-58, (2010), 145-158, incorporated herein by reference. Such statistical treatments model the position and appearance of facial features from a training set of images, and, when applied to new images, find the combination of feature locations which best fit the model. Deficiencies of these approaches include the fact that each frame of the video sequence is processed in isolation, resulting in a series of feature location solutions which may jump from one frame to the next.
Other classes of solutions augment the statistical models of shape and appearance with temporal models. An example of this type of method is described by Prabhu et al., “Automatic Facial Landmark Tracking in Video Sequences using Kalman Filter Assisted Active Shape Models,” Proceedings of the Third Workshop on Human Motion in Conjunction with the European Conference on Computer Vision (ECCV '10), (2010), incorporated herein by reference. Prabhu, et al. use a Kalman filter to provide an estimation of the position of the facial features in the next frame, given their positions in all previous frames, and then use an ASM to update this initial estimation. This approach, like kindred methods in the literature, uses the temporal information only in the forward direction, to improve the estimation of the next frame, and cannot be retrospectively detect and correct tracking errors which may have occurred in the previous frames. Such unidirectional temporal models are subject to drifting, and are not capable of avoiding being drawn towards local minima in the solution space.
In prior work on multi-hypothesis feature tracking, for each feature, multiple potential matches are identified in each frame, as described, for example, by Chen et al., “Mode-based Multi-Hypothesis Head Tracking Using Parametric Contours,” Proc. Fifth IEEE Int. Conf. on Automatic Face and Gesture Recognition, pp. 112-17 (2002), and by Cham, et al., “Multiple hypothesis approach to figure tracking,” Proc. IEEE CVPR, vol. 2, pp. 239-45, (1999), which are incorporated herein by reference. Previous approaches, however, provide only the location and estimated orientation of the whole face, and cannot track individual features within the face to enable expression encoding and transfer to animated characters.