The automatic localization of facial landmarks (also referred to as facial landmarking or facial alignment), such as the corners of the eyes, the tip of the nose, the tip of the chin, and the corners of the mouth, is a key pre-processing step that can aid in performing facial recognition, generation of 3D facial models, expression analysis, superresolution of faces, pose estimation, gender and ethnicity classification, age estimation, facial hair segmentation, and a variety of other facial analytic tasks. With the strides made in all of these areas over the past few years there has been a shift towards harnessing local information in regions around key facial landmarks, apart from global information that can be obtained after the use of a face detector that provides a bounding box around a face in an image. This has, in turn, motivated the need for extremely precise automatic facial landmarking methods and systems that can generalize and adapt well enough to be able to handle variations in pose, illumination, expression, levels of occlusion, and image resolution in unseen test images. It is also desirable that such methods be able to be trained on a limited amount of training data, as providing manually annotated ground truths that are necessary to train these systems is an arduous task.
Facial landmark localization has been well researched over the past few years and a variety of different techniques have been proposed to deal with the problem. Traditionally facial landmarking has been carried out using deformable template (parametric) based models, such as Active Shape Models (ASMs) and Active Appearance Models (AAMs). Both ASMs and AAMs build shape models (also referred to as Point Distribution Models (PDMs)), that model the shape of a typical face that is represented by a set of constituent landmarks, and texture models of what the region enclosed by these landmarks looks like. The difference between the two is that ASMs build local texture models of what small 1D or 2D regions around each of landmarks look like, while AAMs build global texture models of the entire convex hull bounded by the landmarks. ASMs belong to a class of methods that can be broadly referred to as Constrained Local Models (CLMs). CLMs build local models of texture around landmarks (sometimes referred to as patch experts) and allow landmarks to drift into the locations that optimize a cost function by updating and manipulating a set of shape coefficients to generate a final set of landmarks that are in accordance with the “rules” for what a typical face looks like. Several improvements have been made to ASMs over the years that have mainly focused on developing better local texture models, however, they still remain susceptible to the problems of facial occlusion and local minima, and are very dependent on good initialization being provided. Thus, several efforts have been made to develop alternative shape regularization techniques to better cope with pose variation and partial occlusion of the face.
Over the past few years there has been a dramatic increase in literature dealing with the automatic landmarking of non-frontal faces. Everingham et al. developed an algorithm that used a generative model of facial feature positions (modeled jointly using a mixture of Gaussian trees) and a discriminative model of feature appearance (modeled using a variant of AdaBoost and “Haar-like” image features) to localize a set of 9 facial landmarks in videos with faces exhibiting slight pose variation. Dantone et al. used conditional regression forests to learn the relations between facial image patches and the location of feature points conditioned on global facial pose. Their method also localized a sparse set of 10 landmarks in real-time and achieved accurate results when trained and tested on images from the Labeled Faces in the Wild (LFW) database. Belhumeur et al. proposed a novel approach to localizing facial parts by combining the output of local detectors with a consensus of nonparametric global models for part locations computed using training set exemplars, that served as a surrogate for shape regularization, in a Bayesian framework.
Their approach was able to localize a set of 29 facial landmarks on faces that exhibited a wider range of occlusion, pose, and expression variation than many previous approaches.
In a recent work, Zhu and Ramanan proposed a framework that built on the previously developed idea of using mixtures of Deformable Part Models (DPMs) for object detection to simultaneously detect faces, localize a dense set of landmarks, and provide a course estimate of facial pose (yaw) in challenging images. Their approach used a mixture of trees with a shared pool of parts to model each facial landmark. Global mixtures were used to capture changes in facial shapes across pose and the tree-structured models were optimized quickly and effectively using dynamic programming. Their approach is quite effective at localizing landmarks across all views on clean (un-occluded) images that do not exhibit excessive occlusion levels. However, their approach is not extremely accurate when it comes to landmarking occluded faces or faces that exhibit large in-plane rotation. Asthana et al. developed a discriminative regression based approach for the CLM framework that they referred to as Discriminative Response Map Fitting (DRMF). DRMF represents the response maps around landmarks using a small set of parameters and uses regression techniques to learn functions to obtain shape parameter updates from the response maps.
All of the previously mentioned facial alignment algorithms are capable of providing accurate fitting results on some challenging images but lack some features provided by the method and system of the present invention. Some of the previously mentioned approaches only localize a sparse set of landmarks which is unsuitable for many real-world applications, such as expression analysis or the building of 3D facial models, that require a slightly denser set of landmarks to establish point correspondences. Also, none of the approaches demonstrate the capability of handling yaw variation in excess of +45° and are thus incapable of automatically landmarking profile faces. Finally, even though a few of the previously mentioned approaches demonstrate slight tolerance to partially occluded faces, none of them provide a score or label that can be used to determine which landmarks are potentially misaligned or occluded. It would be desirable to address all of these issues in a single framework.
The task of automatically landmarking low resolution images that also exhibit pose variation and partial occlusion of the face must also be addressed. There has been some prior work on facial alignment of frontal low resolution facial images. Liu et al. built a multi-resolution AAM at various scales of facial size and used the most appropriate model (with a model resolution slightly higher than the facial resolution) to fit low resolution faces (of varying resolution) in a few video sequences. Dedeoglu et al. proposed a Resolution-Aware Formulation (RAF) that modified the original AAM fitting criterion in order to better fit low resolution images and used their method to fit 180 frames of a video sequence. Qu et al. extended a traditional CLM to a multi-resolution model consisting of a 4-level patch pyramid and also used various feature descriptors to construct the patch experts. They compared their approach (using various feature descriptors) against a baseline CLM approach on downsampled 35×35, 25×25, and 15×15 faces from a few databases, such as the MPIE database, and demonstrated acceptable fitting accuracies on the low resolution faces. None of the previously mentioned works, however, investigated the challenges posed to the fitting process by the presence of facial pose variation in low resolution images.
Because facial shape and the local texture around the landmarks that constitute them vary in a nonlinear fashion with facial pose and expressions, it is necessary to build not one, but multiple models that can best span and account for these variations. Additionally, occlusions can occur anywhere on the face and vary significantly in textural appearance (e.g. sunglasses, hats, human hair, hands, cellular phones, scarves, other faces, etc.). Thus, the building of models to account for them based on where they could typically lie (a shape based modeling approach to handling occlusions), is an idea that does not generalize well to real-world images. Many existing facial alignment algorithms also rely heavily on consistent facial detection results, something that is seldom guaranteed when dealing with real-world images data as facial bounding box results produced by the same detector vary in size and location even for a similar set of images and do not always account for in-plane rotation (roll) of the face.