1. Field of the Invention
The present invention is a system and method for determining the age of people based on their facial images, using multi-category decomposition architecture of classifiers.
2. Background of the Invention
The method is based on an observation about the facial image ensembles in the image pixel space: that there are a great degree of variability within each set of facial images from the same age group due to gender, ethnicity, and individual differences, so that it is hard to recognize the age group of a person using a traditional multi-class classification approach. Machine learning-based classification methods have been successfully applied to many classification problems, when there are enough training examples available. These methods are based on estimating a mapping from the image space to the space of real numbers or a discrete set of categories using the known relationship between the training images and the ground-truth target values/class labels. The mapping should, therefore, disregard all the irrelevant variations of the image ensembles from the same category, such as lighting, pose, hairstyles, gender, ethnicity, etc. However, it is hard for a single learning machine or several machines (each classifies one category against other categories) to learn and represent the large degrees of variations.
Many complex classification problems can be handled using multi-classifier architectures. A parallel multi-classifier is one of such architectures; multiple specialized learning machines are trained where each machine is tuned to instances from a specific class. The input data is processed by the set of classifiers in a parallel manner, and the final decision is made based on all of the responses from these specialized machines. Many multi-class classification problems are handled this way. Another kind is a serial multi-classifier architecture, where the first classifier performs a gross-level classification and the next classifiers perform finer-level classification. The series of classifications are applied in a serial manner. In the example of age classification, the first classifier can perform children versus non-children classification, the second classifier can perform adult versus senior classification on non-child instances, and so on.
The third kind of multi-classifier architecture is the hybrid multi-classifier architecture. The classification problem is broken down to multiple hybrid classifications where each classifier performs both the gross-level classification and the specialized classification. Each hybrid classifier is tuned to a specific gross-level class, and also performs finer-level classification, which is specialized to the given gross-level class. However, the gross-level classifier does not have to make a hard decision, because it can just output the gross-level class membership likelihood. The same machine or a separate machine can deal with the finer-level classification for the instances from all the gross-level classes, but is specialized to the given gross-level class. The specialization can be implemented by enforcing more accurate finer-level classifier outputs for the instances from the given gross-level class. For example, in the age classification problem, one male hybrid classifier is tuned to male instances and one female hybrid classifier is tuned to female instances. Each classifier also performs age classification (for example, classifies into children, adults, seniors) for all possible data, but the male classifier is specialized to the age classification of male instances and the female classifier is specialized to the age classification of female instances. The final decision is made based on the output from all the hybrid classifiers, by the classifier fusion scheme.
The present invention handles the age classification problem by introducing a multi-category decomposition architecture, which is an exemplary embodiment of the hybrid multi-classifier architecture, where the learning machines are structured and trained to represent the face manifold parameterized by the appearance-based demographics categories. The aforementioned hardship for learning a high-level concept (such as age categories) is handled by decomposing the facial image space into subspaces, where each subspace represents a demographics category. The main idea is to group faces having similar appearances, and perform the task of classification within each group (performed by a specialized hybrid classifier). Pixel appearance based clustering can also be considered; however, the pixel value based clustering does not usually yield meaningful clusters. The clusters may reflect rather arbitrary and accidental features, such as lighting or pose variation. The present invention makes use of auxiliary category information—the demographics categories that are not of given interest—to group the facial images. Age classification can greatly benefit from this scheme, due to the great degree of appearance variability within each gender and ethnicity group. In the case of ethnicity classification, the gender and age are the auxiliary categories. A specialized hybrid learning machine is dedicated to each auxiliary class (=gross-level classes: gender and ethnicity), and age classification is performed within the class. In one of the exemplary embodiments, the multi-category decomposition parameterizes the gender and ethnicity variation using multiple learning machines; each learning machine is learned to respond to a specific (gender, ethnicity) category, and at the same time learned to classify the age group within the (gender, ethnicity) category. The strength of the approach comes from the fact that each machine specializes in the age classification for a limited variation of facial appearance (from the same age and ethnicity). In an exemplary embodiment, the specialization of a learning machine to a given auxiliary category can be implemented using a multi-manifold learning scheme, where the space of facial images is expanded to multiple face manifolds, each corresponding to one of the multiple auxiliary demographics, where each manifold is parameterized by the age vector; the age vector is a representation of age using multiple age-tuned functions. However, the system does not make a hard decision about to which auxiliary class a face belongs. Because there is uncertainty in the class membership, all of the auxiliary class learning machines contribute to the final decision where the contribution is weighted by the likelihood of the given face belonging to the auxiliary class.
Virtually all of the demographics classification system employs some kind of training based on the ground-truth demographics labels. The present invention does not take too much extra effort by annotating auxiliary information in addition to the age value or age group.
The present invention can also handle age classification more effectively by extracting age sensitive features of human faces and conducting classification on these age sensitive features, rather than simply using the raw face images for classifying ages. The age sensitive features can be computed by high-frequency filters that are tuned to location and sizes of facial features, or facial wrinkles.
There have been prior attempts for doing demographics classification based on facial images of people.
U.S. Pat. No. 5,781,650 of Lobo, et al. (hereinafter Lobo) disclosed a method for automatically finding facial images of a human face in a digital image, and classifying the age of the person into an age category. Step 1 of the process is to find facial features of the digital image encompassing the chin, sides of the face, virtual top of the head, eyes, mouth and nose of the image. Step 2 is to compute the facial feature ratios of the facial features ratios of the facial features found in Step 1. Step 3 is to compute a wrinkle analysis of the image. Step 4 is to combine the previous two steps to categorize age of the facial image. The invention can locate and detect facial images for age classification from digital camera images and computer generated images.
U.S. Pat. No. 6,990,217 of Moghaddam, et al. (hereinafter Moghaddam) disclosed a method to employ Support Vector Machines (SVMs) to classify images of faces according to gender, by training the images, including images of male and female faces; determining a plurality of support vectors from the training images for identifying a hyperplane for the gender decision; and reducing the resolution of the training images and the test image by sub-sampling before supplying the images to the Support Vector Machine.
U.S. patent application Ser. No. 11/811,614 filed on Jun. 11, 2007 of Moon, et al. (hereinafter Moon) disclosed a method and system to provide a face-based automatic demographics classification system that is robust to pose changes of the target faces and to accidental scene variables, such as noise, lighting, and occlusion. Given a video stream of people's faces detected from a face detector, the two-dimensional and three-dimensional poses are estimated to facilitate the tracking and the building of pose-dependent facial models. Once the track is complete, the separately built pose-dependent facial models are fed to the demographics classifiers that are again trained using only the faces having the corresponding pose, to determine the final face category, such as gender, age, and ethnicity of the person.
“Mixture of experts for classification of gender ethnic origin and pose of human faces.” IEEE Transaction on Neural Networks, 11(4):948-960, 2000, S Gutta, J. R. et al. (hereinafter Gutta), disclosed a method to classify gender and ethnicity of human faces using mixtures of experts. The mixture of experts is implemented using the “divide and conquer” modularity principle with respect to the granularity and/or the locality of information. The mixture of experts consists of ensembles of radial basis functions (RBFs). Inductive decision trees (DTs) and support vector machines (SVMs) implement the “gating network” components for deciding which of the experts should be used to determine the classification output and to restrict the support of the input space.
In Lobo, the problem of age classification is handled by focusing on local features that are relevant to aging. The approach is both local feature-based and also per-image classification. While Lobo makes use of local features to solve the age classification problem as the present invention does, the approach is vastly different. The proposed invention performs machine learning training and classification on the extracted age sensitive features so that the machine can automatically learn the relation between these features and the person's age, while Lobo explicitly detects the facial features and wrinkle features and performs rule-based analysis. In the present invention, there is no risk of error in feature detection, because the method collects all potential age sensitive feature responses and the machine learning training learns the relevancy of the features in the context of age classification.
In Moghaddam, they proposed to employ SVM to find the optimal separating hyperplane in feature space to solve the gender recognition problem. This is a typical approach to solve the demographics recognition problem, by estimating the direct relation from the facial image to the demographics labels (such as male, female, etc.). While the age classification problem can be solved in the same manner, a small number of SVMs must learn the concept of age, where there is significant within-class variation. The proposed invention solves the issue by partitioning the facial image space into meaningful groups based on the auxiliary demographics categories such as gender and ethnicity.
In Moon, a comprehensive approach to perform demographics classification from tracked facial images has been introduced. The method to carry out the demographics classification, including the ethnicity classification, also utilizes a conventional machine learning approach to find a mapping from the facial image data to the class labels. Moon put an emphasis on solving the nontrivial problem of pose for the demographics classification, while the present invention focuses on the problem of learning the demographics concept by decomposing the facial image space into auxiliary demographics classes. The present invention also utilizes 2D facial geometry and correction method similar to the method disclosed in Moon.
In Gutta, the methods to classify gender, ethnicity, and pose using the ensembles of neural networks and decision trees have been introduced. While Gutta also uses multiple learning machines (RBF neural networks and decision trees) for the classification problems, they use multiple learning machines blindly without any regard to other demographics information. They cluster the face images using k-means clustering algorithm based on the facial appearance. However, the appearance (pixel values) based clustering does not usually yield meaningful clusters. The clusters may reflect rather arbitrary features, such as lighting or pose variation. The present invention systematically uses other auxiliary demographics information to group the facial images, effectively dividing the age classification into meaningful classes.
There have been prior attempts for finding class information of data by utilizing information from another class or the data attributes in another dimension.
U.S. Pat. No. 5,537,488 of Menon, et al. (hereinafter Menon) disclosed a pattern recognition system. In the training step, multiple training input patterns from multiple classes of subjects are grouped into clusters within categories by computing correlations between the training patterns and the present category definitions. After training, each category is labeled in accordance with the peak class of patterns received within the cluster of the category. If the domination of the peak class over the other classes in the category exceeds a preset threshold, then the peak class defines the category. If the contrast does not exceed the threshold, then the category is defined as unknown. The class statistics for each category are stored in the form of a training class histogram for the category. During testing, frames of test data are received from a subject and are correlated with the category definitions. Each frame is associated with the training class histogram for the closest correlated category.
U.S. Patent Application 20020169730 of Lazaridis, et al. (hereinafter Lazaridis) disclosed computational methods for classifying a plurality of objects or for identifying one or more latent classes among a plurality of objects. The methods glean relationships across at least two distinct sets of objects, allowing one to identify latent classes of objects along one set of margins, observations about which objects provide insight into possible properties or characteristics of objects along another set of margins.
U.S. Patent Application 20030210808 of Chen, et al. (hereinafter Chen) disclosed a method of organizing images of human faces in digital images into clusters, comprising the steps of: locating images of human faces in the digital images using a face detector; extracting the located human face images from the digital images; and forming clusters of the extracted human face images, each cluster representing an individual using a face recognizer.
U.S. Pat. No. 7,236,615 of Miller, et al. (hereinafter Miller) disclosed a method for human face detection that detects faces independently of their particular poses and simultaneously estimates those poses. The method exhibits immunity to variations in skin color, eyeglasses, facial hair, lighting, scale and facial expressions, and others. A convolutional neural network is trained to map face images to points on a face manifold, and non-face images to points far away from that manifold, wherein that manifold is parameterized by facial pose. Conceptually, we view a pose parameter as a latent variable, which may be inferred through an energy-minimization process. To train systems based upon our inventive method, we derive a new type of discriminative loss function that is tailored to such detection tasks. Our method enables a multi-view detector that can detect faces in a variety of poses, for example, looking left or right (yaw axis), up or down (pitch axis), or tilting left or right (roll axis).
The present invention employs an auxiliary class determination method similar to Menon; it simply utilizes the auxiliary class likelihood to weight the age outputs. Lazaridis proposed approaches to identifying one or more latent classes among data by utilizing the class information or data attributes in another dimension. To extract more reliable age information, the present invention makes use of the auxiliary class information. The present invention shares its very broad framework with Lazaridis; it proposes a novel approach to utilizing the fact that the age comparison is more meaningful within the same gender or ethnicity class. Chen introduced a facial image clustering method where the clustering is based on the similarity score from a face recognizer. The present invention utilizes the auxiliary class (membership) likelihood to weight the age scores; however, the class clusters come from auxiliary demographics information rather than the appearance-based scores as in Chen. The present invention shares a fundamental idea with Miller, that of using auxiliary or latent information to improve classification. In Miller, however, the space of facial images are expanded by a convolutional neural network to a single face manifold parameterized by continuous pose parameter, which is assumed to be available, for the purpose of classifying faces from nonfaces. In the present invention the space is expanded to multiple face manifolds, each corresponding to one of the multiple auxiliary demographics, where each manifold is parameterized by the age vector.
There have been prior attempts for detecting human faces in still images or in videos.
U.S. Pat. Appl. Pub. No. 20020102024 of Jones, et al. (hereinafter Jones) disclosed an object detection system for detecting instances of an object in a digital image using an image integrator and an object detector, which includes a classifier (classification function) and an image scanner. The image integrator receives an input image and calculates an integral image representation of the input image. The image scanner scans the image in same sized subwindows. The object detector uses a cascade of homogeneous classification functions or classifiers to classify the subwindows as to whether each subwindow is likely to contain an instance of the object. Each classifier evaluates one or more features of the object to determine the presence of such features in a subwindow that would indicate the likelihood of an instance of the object in the subwindow.
U.S. Pat. Appl. Pub. No. 20020159627 of Schneiderman, et al. (hereinafter Schneiderman) disclosed an object finder program for detecting the presence of a three-dimensional object in a two-dimensional image containing a two-dimensional representation of the three-dimensional object. The object finder uses the wavelet transform of the input two-dimensional image for object detection. A pre-selected number of view-based detectors are trained on sample images prior to performing the detection on an unknown image. These detectors then operate on the given input image and compute a quantized wavelet transform for the entire input image. The object detection then proceeds with sampling of the quantized wavelet coefficients at different image window locations on the input image and efficient look-up of pre-computed log-likelihood tables to determine object presence. The object finder's coarse-to-fine object detection strategy coupled with exhaustive object search across different positions and scales results in an efficient and accurate object detection scheme.
The disclosed method assumes that a stream of detected faces are fed to the system, where face detection is performed by utilizing a machine learning-based face detection method, similar to the method disclosed in Jones and Schneiderman.
There have been prior attempts for tracking a human face in video, using appearance-based cue.
U.S. Pat. No. 6,526,156 of Black, et al. (hereinafter Black) disclosed a system that tracks and identifies view-based representations of an object through a sequence of images. As the view of the object changes due to its motion or the motion of its recording device, the object is identified by matching an image region containing the object with a set of basis images represented by an eigenspace. This identification and tracking system operates when views of the object in the image are deformed under some transformation with respect to the eigenspace. Matching between the image region and the eigenspace is performed via a robust regression formulation that uses a coarse-to-fine strategy with incremental refinement. The transformation that warps the image region of a current image frame into alignment with the eigenspace is then used to track the object in a subsequent image frame.
U.S. Pat. Appl. Pub. No. 20030161500 of Blake, et al. (hereinafter Blake) disclosed a new system and method for probabilistic exemplar-based tracking of patterns or objects. Tracking is accomplished by first extracting a set of exemplars from training data. A dimensionality for each exemplar cluster is then estimated and used for generating a probabilistic likelihood function for each exemplar cluster. Any number of conventional tracking algorithms is then used in combination with the exemplars and the probabilistic likelihood functions for tracking patterns or objects in a sequence of images, or in spatial or frequency domains.
U.S. Pat. No. 6,973,201 of Colmenarez, et al. (hereinafter Colmenarez) disclosed an image processing system that processes a sequence of images to generate a statistical model for each of a number of different persons to be tagged so as to be identifiable in subsequent images. The statistical model for a given tagged person incorporates at least one appearance feature, such as color, texture, etc., and at least one geometric feature, such as shape or position of a designated region of similar appearance within one or more images. The models are applied to subsequent images in order to perform a person detection, person location and/or person tracking operation. An action of the image processing system is controlled based on a result of the operation.
U.S. Pat. Appl. Pub. No. 20050265581 of Porter, et al. (hereinafter Porter) disclosed a video data processing apparatus, the video data comprising a sequence of images composed of: an object tracker operable to detect the presence of one or more objects within an image and to track a detected object across successive images; an identity associator operable to associate an identity with an object tracked by the object tracker; a counter operable, for a first and second identity, to count the number of images within which a tracked object associated with the first identity and a tracked object associated with the second identity have both been detected; a similarity detector operable to determine whether two tracked objects are similar in appearance; and the identity associator being operable to change the identity associated with a first tracked object to the identity associated with a second tracked object if: (a) the similarity detector determines that the first and second tracked objects are similar in appearance and (b) the count corresponding to the identities associated with the first and second tracked objects, as counted by the counter, is less than a predetermined threshold.
The disclosed invention utilizes the facial appearance to keep the identity of people, as in Black and Blake. However, the method does not require offline training or model building, because the proposed application builds online models. The inventions of Colmenarez and Porter are designed to track multiple faces and keep the person identity at the same time. The proposed invention, however, does not perform explicit tracking, which requires continuity of the tracks; it just makes correspondences between detected faces. Most of these tracking approaches will fail under low frame rates or severe occlusion, however, the proposed method is still able to track faces under these circumstances.
In summary, the present invention proposes a method to detect, track, and classify age of the facial images. It employs face detection, face tracking, and 2D facial pose estimation in a manner similar to prior inventions, but has a novel way of dividing the age classification into meaningful auxiliary classes where the age classification takes place. While some of the prior inventions use a similar principle of decomposing the classification problem into multiple specialized classifications, each of these classifiers is specialized to appearance-based clusters, which can be an arbitrary group reflecting different lighting or other non-essential features. The present invention also handles the age classification more effectively by using age sensitive features from facial images for both training and classification. The present invention systematically uses other auxiliary demographics information (such as gender and ethnicity) to group the facial images, and each specialized classification is performed within a meaningful demographics class. The classification results from multiple machines are fused using decision trees in the prior invention, while continuous integration, meaningful to probabilistic sense, is used in the present invention.