Field
This disclosure relates to performance capture for real-time reproduction of facial and speech animation.
Description of the Related Art
Users of virtual reality systems have desired the opportunity to interact within a virtual environment, either alone or with others, while in-game or in-virtual environment avatars mirror those users' facial expressions. Examples such as experiencing a virtual “movie theater” with a friend or taking part in a virtual “chat room” with a group of friends are greatly enhanced when a user's avatar within the virtual environment mirrors that user's real-world facial expressions, speech, and visible emotional reactions.
The options for enabling these types of systems are either monetarily expensive or computationally expensive. Either problem places such capabilities outside of presently-available technology to an average virtual reality consumer. The process is all the more complicated given that virtual reality headsets typically block a large portion of an individual's face from external view. Thus, extrapolating facial expressions can be difficult.
The best-performing methods used in conjunction with virtual reality head mounted displays typically rely upon a combination of tracked facial landmarks and depth sensors. However, these types of systems function poorly when a facial region is occluded, either by a user's hands, or when an individual's mouth changes shape so as to hide a landmark (e.g. a user bites his or her lip).
Other systems, for example for tracking a user's eyes and facial movements within a head mounted display rely upon electroenceophalograms or electromyograms to derive facial movements from electrical currents within muscles and other facial tissues. These systems typically require a great deal of training to “learn” what specific electric nerve impulses mean in terms of facial movement. Alternatively, a captured facial region (or entire face) may be manually animated by an artist on a frame-by-frame basis (or may have only key-frames animated). This process is computationally (and temporally) intensive. More recently, infrared cameras such as the Fove head mounted display has been used to track eye gaze and eye regions. Regardless, these systems rely upon non-standard (or expensive) sensors, require specialized pre-training by the user, or are too computationally expensive to perform in real-time.
It would, therefore, be desirable if there were a system and process by which facial animation could be enabled for head mounted displays with substantial fidelity in real-time for an on-going virtual reality interaction such that an avatar associated with a wearer of a head mounted display could realistically represent the facial expressions of that wearer during the interaction. It would be preferable if none or extremely limited pre-training were required. The process must be sufficiently processor-friendly to enable it to take place in real-time without overly-taxing currently available computing systems.
Throughout this description, elements appearing in figures are assigned three-digit reference designators, where the most significant digit is the figure number and the two least significant digits are specific to the element. An element that is not described in conjunction with a figure may be presumed to have the same characteristics and function as a previously-described element having a reference designator with the same least significant digits.