Over the last decade, new applications in computer vision and computational photography have arisen due to earlier advances in methods for detecting human faces in images. These applications include face detection-based autofocus and white balancing in cameras, smile and blink detection, new methods for sorting and retrieving images in digital photo management software, obscuration of facial identity in digital photos, facial expression recognition, virtual try-on, product recommendations, facial performance capture, avatars, controls, image editing software tailored for faces, and systems for automatic face recognition and verification.
The first step of any face processing system is the detection of locations in the images where faces are present. However, face detection from a single image is challenging because of variability in scale, location, orientation, and pose. Facial expression, occlusion, and lighting conditions also change the overall appearance of faces.
Given an arbitrary image, the goal of face detection is to determine whether or not there are any faces in the image and, if present, return the image location and extent of each face. The challenges associated with face detection can be attributed to the following factors:
Pose: The images of a face vary due to the relative camera-face pose (frontal, 45 degree, profile, upside-down), and some facial features such as an eye or the nose may become partially or wholly occluded.
Presence or absence of structural components: Facial features such as beards, moustaches, and glasses may or may not be present, and there is a great deal of variability among these components including shape, color, and size.
Facial expression: The appearance of faces is directly affected by a person's facial expression.
Occlusion: Faces may be partially occluded by other objects. In an image with a group of people, some faces may partially occlude other faces.
Image orientation: Face images vary directly for different rotations about the camera's optical axis.
Imaging conditions: When the image is formed, factors such as lighting (spectra, source distribution and intensity) and camera characteristics (sensor response, lenses, filters) affect the appearance of a face.
Camera Settings: The settings on the camera and the way that is used can affect the image focus blur, motion blur, depth of field, compression (e.g., jpeg) artifacts, and image noise.
Face detectors usually return the image location of a rectangular bounding box containing a face—this serves as the starting point for processing the image. A part of the process that is currently in need of improvement is the accurate detection and localization of parts of the face, e.g., eyebrow corners, eye corners, tip of the nose, ear lobes, hair part, jawline, mouth corners, chin, etc. These parts are often referred to as facial feature points or “fiducial points”. Unlike general interest or corner points, the fiducial point locations may not correspond to image locations with high gradients (e.g., tip of the nose). As a result, their detection may require larger image support.
A number of approaches have been reported which have demonstrated great accuracy in localizing parts in mostly frontal images, and often in controlled settings.
Early work on facial feature detection was often described as a component of a larger face processing task. For example, Burl, et al. take a bottom-up approach to face detection, first detecting candidate facial features over the whole image, then selecting the most face-like constellation using a statistical model of the distances between pairs of features. Other works detect large-scale facial parts such as each eye, the nose, and the mouth and return a contour or bounding box around these components.
There is a long history of part-based object descriptions in computer vision and perceptual psychology. Recent approaches have shown a renewed emphasis on parts-based descriptions and attributes because one can learn descriptions of individual parts and then compose them, generalizing to an exponential number of combinations. The Poselets work by Bourdev and Malik, incorporated herein by reference, describes a data-driven search for object parts that may be a useful approach for addressing some of the described inadequacies of the prior art in order to achieve precise face detection in uncontrolled image conditions.
Many fiducial point detectors include classifiers that are trained to respond to a specific fiducial (e.g., left corner of the left eye). These classifiers take as input raw pixel intensities over a window or the output of a bank of filters (Gaussian Derivative filters, Gabor filters, or Haar-like features). These local detectors are scanned over a portion of the image and may return one or more candidate locations for the part or a “score” at each location. This local detector is often a binary classifier (feature or not-feature). For example, the Viola-Jones style detector, which uses an image representation called an “integral image” rather than working directly with image intensities, has been applied to facial features. False detections occur often, even for well-trained classifiers, because portions of the image have the appearance of a fiducial under some imaging condition. For example, a common error is for a “left corner of left eye” detector to respond to the left corner of the right eye. Eckhart, et al. achieve robustness and handle greater pose variation by using a large area of support for the detector covering, e.g., an entire eye or the nose with room to spare. Searching over a smaller region that includes the actual part location reduces the chance of false detections with minimal impact of missing fiducials. While this may be somewhat effective for frontal fiducial point detection, the location of a part within the face detector box can vary significantly when the head rotates in three-dimensions. For example, while the left eye is in the upper-left side of the box when frontal, it can move to the right side when the face is seen in profile.
To better handle larger variations in pose, constraints can be established about the relative location of parts with respect to each other rather than the actual location of each part to the detector box. This can be expressed as predicted locations, bounding regions, or as a conditional probability distribution of one part location given another location. Alternatively, the joint probability distribution of all the parts can be used, and one model is that they form a multivariate normal distribution whose mean is the average location of each part. This is the model underlying Active Appearance Models and Active Shape Models, which have been used for facial feature point detection in near frontal images. Saragih, et al. extend this to use a Gaussian Mixture Model, whereas Everingham, et al. handle a wider range of pose, lighting and expression variations by modeling the joint probability of the location of nine fiducials relative to the bounding box with a mixture of Gaussian trees. As pointed out in this work, a joint distribution of part locations over a wide range of poses cannot be adequately modeled by a single Gaussian.
While a number of approaches balance local feature detector responses on the image with prior global information about the feature configurations, optimizing the resulting objective function remains a challenge. The locations of some parts vary significantly with expression (e.g., the mouth, eyebrows) whereas others, such as the eye corners and nose, are more stable. Consequently, some detection methods organize their search to first identify the stable points. The location of the mouth points are then constrained, possibly through a conditional probability, by the locations of stable points. However, this approach fails when these stable points cannot be reliably detected, for example, when the eyes are hidden by sunglasses.
The need for the ability to reliably detect and identify features within an image is not limited to human facial recognition. Many other disciplines rely on specific features within an image to facilitate identification of an object within an image. For example, conservation organizations utilize markings such as ear notches, scars, tail patterns, etc., on wild animals for identification of individual animals for study of migration patterns, behavior and survival. The ability to reliably locate and identify the unique features within an image of an animal could provide expanded data for use in such studies. Other applications of image analysis that could benefit from improved feature location capability include identification of vehicles within images for military or law enforcement applications, and identification of structures in satellite images, to name a few.