Robust face recognition and analysis are contingent upon accurate localization of facial features. When modeling faces, the landmark points of interest consist of points that lie along the shape boundaries of facial features, e.g. eyes, lips, mouth, etc. When dealing with face images collected in the wild, facial occlusion of landmarks becomes a common problem for off-angle faces. Predicting the occlusion state of each landmarking point is challenging due to variations of objects in faces, e.g. beards and mustaches, sunglasses and other noisy objects. Additionally, face images of interest usually contain off-angle poses, illumination variations, low resolutions, and partial occlusions.
Many complex factors affect the appearance of a facial image in real-world scenarios. Providing tolerance to these factors is difficult. Among these factors, pose is often the most important factor. As facial pose deviates from a frontal view, most face recognition systems have difficulty in performing robustly. To handle a wide range of pose changes, it becomes necessary to utilize 3D structural information of faces. Many of the existing 3D face modeling schemes have drawbacks, such as computation time and complexity. This causes difficulty when applying these schemes in real-world, large scale, unconstrained face recognition scenarios.
While estimating a 3D model from images is not a new problem, the challenging task of modeling objects from a single image has always posed a challenge. This is, of course, due to the ambiguous nature of images where depth information is removed. Recently, deep learning using convolutional neural networks (CNNs) has been used successfully to extract salient information from images. There have been many explorations into how to best use CNNs for modeling objects in 3 dimensions. Many of these approaches are aimed creating a depth estimation for natural images. While the results on uncontrolled images are impressive, the fact that these models are general means they tend to be less effective when applied to specific objects, such as faces. Often, the depth estimate for faces in the scene tends to be fairly flat. By limiting the scope of the method, the resulting estimated 3D model can be made much more accurate. A 3D model of the face can be used to frontalize faces in unseen images with the end goal of improving face recognition by limiting the variations required to be learned by the matcher. However, this approach requires landmarks on the input face in the same fashion as other methods.
A 2D approach to landmarking inevitably suffers from the problem of visibility and self-occlusion. The problem of landmark marching, where landmarks tend to move to the visible boundary, can cause issues when estimating 3D models from 2D alignment. However, this problem can be alleviated by using a 3D model of the face in the alignment step. Such methods make use of an underlying 3D Morphable Model (3DMM) and try to fit the model to the input image to find the required landmarks. This requires a basis, such as the popular Basel Face Model (BFM). However, the BFM is only created from a set of 100 male and 100 female scans. A new 3D model is generated as a combination of the example faces. As any basis can only recreate combinations of the underlying samples, the capability of these models is severely limited in their ability to fit outlier faces or expressions not seen before. This, a key flaw in many approaches that rely on a 3DMM is that enough examples of the data required to model unseen samples. However, in the case of 3D faces, most datasets are very small.