Field
This disclosure relates to real-time facial segmentation and performance capture from RGB input.
Description of the Related Art
There is a great deal of research related to three-dimensional, real-life modeling and facial capture. Most capture systems that provide any degree of accuracy and fidelity are based upon a series of capture points (e.g. fiducial markers) placed upon a body or face for later identification in post-processing of the associated video. Most individuals are familiar with a series of “white dots” placed at various places upon a human body or on a face itself so that those “white dots” may be identified automatically by computer modeling systems after the associated video of the individual or face has been captured. Those dots may then be used to extrapolate lifelike motion onto computer-generated models (e.g. the character Gollum in the Lord of the Rings series of movies whose actions were actually captured acting by a human individual wearing a suit and facial mask of white dots and thereafter translated onto the computer generated character of Gollum).
Similarly, other systems rely upon white dots or other markers on an individual's face so as to capture acting and other facial motions on a human that may be translated into a computer-generated character's face during post processing. But, these systems either require too much setup or are generally incapable of functioning in real-time (e.g. near-simultaneously with the image capture). Instead, these systems rely upon computer function, and sometimes a great deal of processing power over hours of time, for a given even seconds-long video. And, as should be obvious, these systems rely upon a great deal of setup including adding all of those white dots to a person's body or face, green screens for filming in front of, and matching a particular model (e.g. a computer-generated face or body) to the associated white dots.
In a related field, there exist many facial capture or facial recognition systems that rely upon natural landmark detection. These types of systems typically identify a set of facial landmarks (e.g. the center and edges of both eyes, the center and nostrils of a nose, the center top, center bottom, and each corner of a mouth) to identify a particular individual or to identify a facial position. More sophisticated systems of these types can rely upon facial three-dimensional modeling. However, most of these systems rely upon visibility of a substantial number of those facial landmarks. So, if many or sometimes even only a few of those facial landmarks are covered by a person's hair, hands, or some other obstruction, facial identification or capture systems like these typically function poorly or not at all. They become unable to identify an individual or unable to readily identify the position or pose (much less facial positions such as frowning, mouth open, etc.). Those that rely upon facial three-dimensional modeling likewise fail when faces are partially occluded because these systems have trouble extrapolating a facial mask and pose from a limited data set including an occluded face.
Still other systems, more closely related to the present system, are capable of near real-time operation by relying upon convolutional neural networks trained with facial data so as to identify facial portions of an RGB image (without three-dimensional data). However, these systems typically have difficulty dealing with occlusions (e.g. hands in front of the face, shadows, hair, or otherwise portions of the face being blocked). These systems either misidentify faces or misidentify non-faces as facial regions when presented with occluded images. To deal with occlusions, some of these systems apply depth data (e.g. three-dimensional scanning, for example, using a LIDAR) in addition to two-dimensional (in space, as opposed to color depth) image data. By adding depth data, these systems can much more accurately identify most occlusions. But, reliance upon depth sensors requires the presence of the depth sensors that, at present, are not common on most sources of RGB image data like mobile phones and standard digital video cameras.
It is therefore desirable to enable real-time facial segmentation and performance capture using only RGB input data and, in particular, such a system that is capable of robust handling of image occlusions like similar skin-based occlusions (e.g. hands) covering some of a facial region.
Throughout this description, elements appearing in figures are assigned three-digit reference designators, where the most significant digit is the figure number and the two least significant digits are specific to the element. An element that is not described in conjunction with a figure may be presumed to have the same characteristics and function as a previously-described element having a reference designator with the same least significant digits.