1. Field of the Invention
The present invention is a method and system to provide a face-based automatic demographics classification system that is robust to pose changes of the target faces and to accidental scene variables, such as noise, lighting, and occlusion. Given a video stream of people's faces detected from a face detector, the two-dimensional (2D) and three-dimensional (3D) poses are estimated to facilitate the tracking and the building of pose-dependent facial models. Once the track is complete, the separately built pose-dependent facial models are fed to the demographics classifiers that are again trained using only the faces having the corresponding pose, to determine the final face category such as gender, age, and ethnicity of the person.
2. Background of the Invention
Face Detection
There have been prior attempts for detecting human faces in still images or in videos.
U.S. Pat. Appl. Pub. No. 20020102024 of Jones et al. (hereinafter Jones) disclosed an object detection system for detecting instances of an object in a digital image using an image integrator and an object detector, which includes a classifier (classification function) and an image scanner. The image integrator receives an input image and calculates an integral image representation of the input image. The image scanner scans the image in same sized subwindows. The object detector uses a cascade of homogeneous classification functions or classifiers to classify the subwindows as to whether each subwindow is likely to contain an instance of the object. Each classifier evaluates one or more features of the object to determine the presence of such features in a subwindow that would indicate the likelihood of an instance of the object in the subwindow.
U.S. Pat. Appl. Pub. No. 20020159627 of Schneiderman et al. (hereinafter Schneiderman) disclosed an object finder program for detecting the presence of a three-dimensional object in a two-dimensional image containing a two-dimensional representation of the three-dimensional object. The object finder uses the wavelet transform of the input two-dimensional image for object detection. A pre-selected number of view-based detectors are trained on sample images prior to performing the detection on an unknown image. These detectors then operate on the given input image and compute a quantized wavelet transform for the entire input image. The object detection then proceeds with sampling of the quantized wavelet coefficients at different image window locations on the input image and efficient look-up of pre-computed log-likelihood tables to determine object presence. The object finder's coarse-to-fine object detection strategy coupled with exhaustive object search across different positions and scales results in an efficient and accurate object detection scheme.
The disclosed method assumes that a stream of detected faces are fed to the system, where face detection is performed by utilizing a machine learning based face detection method, similar to the method disclosed in Jones and Schneiderman.
Face Tracking
There have been prior attempts for tracking a human face in video, using appearance-based cue.
U.S. Pat. No. 6,526,156 of Black, et al. (hereinafter Black) disclosed a system that tracks and identifies view-based representations of an object through a sequence of images. As the view of the object changes due to its motion or the motion of its recording device, the object is identified by matching an image region containing the object with a set of basis images represented by an eigenspace. This identification and tracking system operates when views of the object in the image are deformed under some transformation with respect to the eigenspace. Matching between the image region and the eigenspace is performed via a robust regression formulation that uses a coarse to fine strategy with incremental refinement. The transformation that warps the image region of a current image frame into alignment with the eigenspace is then used to track the object in a subsequent image frame.
U.S. Pat. Appl. Pub. No. 20030161500 of Blake * et al. (hereinafter Blake) disclosed a new system and method for probabilistic exemplar-based tracking of patterns or objects. Tracking is accomplished by first extracting a set of exemplars from training data. A dimensionality for each exemplar cluster is then estimated and used for generating a probabilistic likelihood function for each exemplar cluster. Any number of conventional tracking algorithms are then used in combination with the exemplars and the probabilistic likelihood functions for tracking patterns or objects in a sequence of images, or in spaial or frequency domains.
U.S. Pat. No. 6,973,201 of Colmenarez, et al. (hereinafter Colmenarez) disclosed an image processing system that processes a sequence of images to generate a statistical model for each of a number of different persons to be tagged so as to be identifiable in subsequent images. The statistical model for a given tagged person incorporates at least one appearance feature, such as color, texture, etc., and at least one geometric feature, such as shape or position of a designated region of similar appearance within one or more images. The models are applied to subsequent images in order to perform a person detection, person location and/or person tracking operation. An action of the image processing system is controlled based on a result of the operation.
U.S. Pat. Appl. Pub. No. 20050265581 of Porter et al. (hereinafter Porter) disclosed a video data processing apparatus, the video data comprising a sequence of images composed of: an object tracker operable to detect the presence of one or more objects within an image and to track a detected object across successive images; an identity associator operable to associate an identity with an object tracked by the object tracker; a counter operable, for a first and second identity, to count the number of images within which a tracked object associated with the first identity and a tracked object associated with the second identity have both been detected; a similarity detector operable to determine whether two tracked objects are similar in appearance; and the identity associator being operable to change the identity associated with a first tracked object to the identity associated with a second tracked object if: (a) the similarity detector determines that the first and second tracked objects are similar in appearance and (b) the count corresponding to the identities associated with the first and second tracked objects, as counted by the counter, is less than a predetermined threshold.
The disclosed invention utilizes the facial appearance model to keep the identity of people, as in Black, and Blake. However, the method does not require offline training or model building; because the proposed application builds pose-dependent online models. Provided with the pose of the face, the appearance model does not need to take the pose (two-dimensional or three-dimensional) variations into account. It doe not involve the transformation of the model as the two-dimensional geometric variations are adjusted using the facial pose correction step, and the three-dimensional variation is handled by employing multiple models. The inventions of Colmenarez and Porter are designed to track multiple faces and keep the person identity at the same time. The proposed invention, however, does not perform explicit tracking, which require continuity of the tracks; it just makes correspondences between detected faces. Most of these tracking approaches will fail under low frame rates or severe occlusion, however, the proposed method is still able to track faces under these circumstances.
Facial Pose Estimation
There have been prior attempts for determining the direction where the human head is facing.
U.S. Pat. No. 6,707,933 of Mariani, et al. (hereinafter Mariani) disclosed a method, apparatus, and computer program product for estimating face direction using a single gray-level image of a face are described. Given the single image, a face direction can be determined by computing a nose axis maximizing a correlation measure between the left and right sides of the face. The correlation measure is computed by comparing one of the two sides with another synthetic side derived from the other side using symmetry and perspective transforms. The computation result is a word describing the spatial position of the face and combining height (“up”, “normal”, “down”) and neck-rotation (“left”, “frontal”, “right”).
U.S. Pat. No. 6,741,756 of Toyama, et al. (hereinafter Toyama) disclosed a system and method for automatically estimating the orientation or pose of an object, such as a human head, from any viewpoint and includes training and pose estimation modules. The training module uses known head poses for generating observations of the different types of head poses and the pose estimation module receives actual head poses of a subject and uses the training observations to estimate the actual head pose. Namely, the training module receives training data and extracts unique features of the data, projects the features onto corresponding points of a model and determines probability density function estimation for each model point to produce a trained model. The pose estimation module receives the trained model and an input object and extracts unique input features of the input object, projects the input features onto points of the trained model and determines an orientation of the input object that most likely generates the features extracted from input object.
U.S. Pat. No. 7,043,056 of Edwards, et al. (hereinafter Edwards) disclosed a method of determining an eye gaze direction of an observer. The method comprises the steps of: (a) capturing at least one image of the observer and determining a head pose angle of the observer; (b) utilizing the head pose angle to locate an expected eye position of the observer, and (c) analyzing the expected eye position to locate at least one eye of the observer and observing the location of the eye to determine the gaze direction.
U.S. Pat. No. 7,046,826 of Toyama, et al. (hereinafter Toyama 7046826) disclosed a system and method for estimating and tracking an orientation of a user's face by combining head tracking and face detection techniques. The orientation of the face, or facial pose, can be expressed in terms of pitch, roll and yaw of the user's head. Facial pose information can be used, for example, to ascertain in which direction the user is looking. In general, the facial pose estimation method obtains a position of the head and a position of the face and compares the two to obtain the facial pose. In particular, a camera is used to obtain an image containing a user's head. Any movement of the user's head is tracked and the head position is determined. A face then is detected on the head and the face position is determined. The head and face positions are then compared.
U.S. Pat. Appl. Pub. No. 20040240708 of Hu et al. (hereinafter Hu) disclosed a method to effectively assess a user's face and head pose such that a computer or like device can track the user's attention towards a display device(s). Then the region of the display or graphical user interface toward which the user is turned can be automatically selected without requiring the user to provide further inputs. A frontal face detector is applied to detect the user's frontal face and then component detectors detect key facial points such as left/right eye center, left/right mouth corner, nose tip, etc. The system then tracks the user's head by an image tracker and determines yaw, tilt and roll angle and other pose information of the user's head through a coarse to fine process according to key facial points and/or confidence outputs by the pose estimator.
U.S. Pat. Appl. Pub. No. 20050180626 of Moon et al. (hereinafter Moon) disclosed a method for accurately estimating a pose of the human head in natural scenes utilizing positions of the prominent facial features relative to the position of the head. A high-dimensional, randomly sparse representation of a human face, using a simplified facial feature model transforms a raw face image into sets of vectors representing the fits of the face to a random, sparse set of model configurations. The transformation collects salient features of the face image which are useful to estimate the pose, while suppressing irrelevant variations of face appearance. The relation between the sparse representation of the pose is learned using Support Vector Regression (SVR). The sparse representation, combined with the SVR learning is then used to estimate a pose of facial images.
“Learning Low Dimensional Invariant Signature of {3-D} Object under Varying View and Illumination from {2-D} Appearances,” International Conference in Computer Vision, 2001, of S. Li, J. Yan, X. Hou, Z. Li, and H. Zhang (hereinafter Li) proposes an invariant signature representation for appearances of 3-D objects under varying view and illumination, and a method for learning the signature from multi-view appearance examples. Li claims that the signature, a nonlinear feature, provides a good basis for three-dimensional object detection and pose estimation due to its following properties: (1) its location in the signature feature space is a simple function of the view and is insensitive or invariant to illumination; (2) it changes continuously as the view changes, so that the object appearances at all possible views should constitute a known simple curve segment in the feature space; (3) and the coordinates of object appearances in the feature space are correlated in a known way according to a predefined function of the view. The first two properties are provided as a basis for object detection and the third for view pose estimation. To compute the signature representation from input, the article present a nonlinear regression method for learning a nonlinear mapping from the image space to the feature space.
The prior invention of Mariani solves the problem of facial pose estimation by comparing the relative positions of the facial features, most notably the nose. The estimates put the yaw and pitch of the face in discrete pose bins: (“left”, “rontal”, “right”) and (“up”, “normal”, “down”), where the resolution is not enough to determine whether the person is actually facing the display.
The invention of Toyama builds an explicit parametric (Gaussian) statistical model of the facial feature appearance using training data. The success of the method depends on rough alignment of facial features to the models; misalignment can potentially cause a large degree of error. The present method compares the input patterns against a number of model patterns to compute the likelihood of the given pattern to be from the model. Each likelihood computation is robust due to the use of learning machines, where large number of faces having a wide range of scene variations, such as noise, lighting, and occlusions, are used to train the machine.
There are prior inventions, such as Edwards, on estimating eye gaze to measure the person's degree of attention; measuring eye gaze usually requires a close range, high resolution image. The proposed method is designed to perform well using far-range low-resolution images.
The invention by Toyama (U.S. Pat. No. 7,046,826) estimates the face orientation by comparing the head position and facial position; the method is also susceptible to errors due to the errors in head or face localization, and is only able to compute relative estimates. The present method is able to produce absolute (yw, pt) angle, because the system is designed and trained to output absolute (yw, pt) angles.
The head pose estimation method by Hu uses component detectors to first locate facial features, and compute the facial pose, which poses a risk of large error when the component detectors fail. The proposed method learns the holistic pattern to estimate the pose; it does not involve such risk.
The method by Moon is similar to the proposed method in terms of learning the global patterns on a large number of facial images using a machine learning technique. However, learning the whole space of patterns using a single machine is regarded as inefficient due to the wide range of pose variation. The present method overcomes this weakness by using a plurality of learning machines, each of which is specialized to a given pose range. The use of a set of facial feature-based high-frequency filters is again similar. However, in Moon, the range of facial model pose for generating the filter is the whole range of the possible facial pose; in the proposed method the range of the model pose corresponds to the individual inherent pose of the specific machine, thereby providing a more specialized estimation.
The method by Li is similar to the proposed method in terms of using multiple learning machines (in Li, SVR) to represent and estimate the varying pose of faces (or general objects). However, the proposed method represents and estimates both the two-dimensional pose variations and the three-dimensional variations in a way that each machine is trained to estimate the likelihood of the given face having certain pose, based on the neural tuning principle. In Li, the machines are only used for yaw estimation, and each machine is trained to output a nonlinear function of the yaw angle difference between the input face and the machine's inherent pose. The proposed use of either feature window filters or feature model-based high-frequency filters makes the estimation problem robust to illumination changes in an explicit way.
Face-Based Demographics Classification
There have been prior attempts for recognizing the demographic category of a person by processing the facial image using a machine learning approach.
U.S. Pat. No. 5,781,650 of Lobo, et al. (hereinafter Lobo) disclosed a method for automatically finding facial images of a human face in a digital image, and classifying the age of the person into an age category. Step 1 of the process is to find facial features of the digital image encompassing the chin, sides of the face, and the virtual top of the head, eyes, mouth and nose of the image. Step 2 is to compute the facial feature ratios of the facial features found in Step 1. Step 3 is to compute a wrinkle analysis of the image. Step 4 is to combine the previous two steps to categorize the age of the facial image. The invention can locate and detect facial images for age classification from digital camera images and computerized generated images.
U.S. Pat. No. 6,990,217 of Moghaddam, et al. (hereinafter Moghaddam) disclosed a method to employ Support Vector Machines (SVMs) to classify images of faces according to gender, by training the images, including images of male and female faces; determining a plurality of support vectors from the training images for identifying a hyperplane for the gender decision; and reducing the resolution of the training images and the test image by sub-sampling before supplying the images to the Support Vector Machine.
“Support Vector Learning for Gender Classification Using Audio and Visual Cues,” International Journal of Pattern Recognition and Artificial Intellegence, Vol. 17(3), 2003, of L. Walawalkar, M. Yeasin, A. M. Narasimhamurthy, and R. Sharma (hereinafter Walawalkar) disclosed a computer software system for multi-modal human gender classification, comprising: a first-mode classifier classifying first-mode data pertaining to male and female subjects according to gender and rendering a first-mode gender-decision for each male and female subject; a second-mode classifier classifying second-mode data pertaining to male and female subjects according to gender and rendering a second-mode gender-decision for each male and female subject; and a fusion classifier integrating the individual gender decisions obtained from said first-mode classifier and said second-mode classifier and outputting a joint gender decision for each of said male and female subjects.
The prior arts (Lobo, Moghaddam, and Walawalkar), mentioned above, for demographics classification aim to classify a certain class of demographics profile (either age or gender) based on the image signature of faces. The approaches by Moghaddam and Walawalkar deal with a much smaller scope of problems than claimed method tries to solve; they both assume that the facial regions are identified and only address the problem of individual face classification. They do not address the problem of detecting and tracking the faces for determining the demographic identity of a person over the course of his/her facial exposure to the imaging device. Lobo claims a more comprehensive solution to the problem of face detection, feature detection, and age classification. However, this approach depends heavily on the model-based face detection and facial feature detection under close range high-resolution frontal face images. The proposed invention assumes a much less constrained scenario, and can deal with varied pose (by using pose-dependent model) and low-resolution facial image (by using holistic feature of the face).
The proposed invention is a comprehensive solution where the automated system estimates and corrects the two-dimensional pose variations, estimates the three-dimensional pose, and tracks the people's faces individually. The use of a pose-dependent facial appearance model has twofold improvements. First, it improves the demographics classification accuracy because the accumulated appearance model better represents the appearance of the person's face in that it smoothes out the noise and averages out potential accidental scene variables such as lighting changes, andocclusions. Second, instead of performing classification, which is typically computationally expensive, the proposed method performs classification only once per a person track, at the completion of the track, which makes the whole system very efficient.
In summary, the present invention provides a comprehensive solution of processing the stream of facial images for the purpose of classifying them into demographics categories. It estimates and corrects facial pose, tracks multiple faces, builds pose-dependent appearance models, and performs classification on the appearance models. All of these steps work toward providing an accurate tracking and an accurate and efficient demographics classification.