1. Field of the Invention
The present invention is a method and system to provide a face-based automatic gender recognition system that utilizes automatically decoupled facial feature geometry and appearance information, and automatically extracted hairstyle of humans.
2. Background of the Invention
Automatic gender recognition using images has a wide range of applications such as security, marketing, and computer user interface. Online applications, such as computer user interface, or gender targeted advertisements, especially demand highly accurate gender recognition capabilities.
Traditional gender recognition methods make use of holistic facial appearance and/or bodily features that are specific to certain dress codes or ethnic groups. The use of holistic facial appearance for gender recognition fits well into the framework of machine learning-based classification, because facial appearance has common structure across the human population to be compared against each other, and at the same time provides useful appearance information to differentiate gender. It is well known in the pattern analysis community that one can achieve higher recognition accuracy when the patterns are aligned more accurately. In general, when the overall patterns are aligned, the learning machine does a better job of identifying the fine-level features necessary for identifying the difference between classes.
For the gender recognition problem, the manner through which the human brain processes the visual information from a human face to determine gender is not completely understood. However, there are certain features that are known to contribute more to the task of gender recognition; studies revealed that certain parts of the face or facial features provide more decisive image information for gender recognition. For example, there is a general consensus that differences in the size and shape between male eyebrows and female eyebrows exist.
On the other hand, studies have revealed that using only facial image for gender recognition has limitations; even gender recognition by humans using only facial images is shown to have such limitations. Humans make use of other image cues, such as hairstyles, body shape, and dress codes, for determining the gender.
The present invention proposes a method that makes use of both the parts-based image features and global geometry-based features for an accurate recognition of gender. Both the global-level face localization method and the feature-level localization method have been designed as crucial components of the approach. The face localization method aligns the global shapes of the faces. It also provides approximate facial feature locations that provide a basis for more refined facial feature localization. The appearance of the localized facial features are then extracted, along with the global facial geometry features, to form a feature vector to effectively represent the gender-sensitive image feature of a given face. The present invention also extracts and makes use of non-facial features to be added to the gender-sensitive feature vector. The hairstyle is segmented out from facial images based on an involved analysis: the facial skin and hair tone pixels are sampled based on the accurate localization of faces and on the skin-hair tone discriminant analysis, using a large number of skin tone and hair tone samples. The actual recognition task is carried out by training a learning machine on the collected gender-sensitive feature vectors. The face localization and facial feature localization both involve training images annotated with facial feature locations; the face localization assumes roughly detected facial regions from the face detection step, and facial feature localization assumes corrected facial images using face localization. Both sets of training data are prepared based on these assumptions.
There have been prior attempts for detecting human faces in still images or in videos.
U.S. Pat. No. 6,829,384 of Schneiderman, et al. (hereinafter Schneiderman) disclosed an object finder program for detecting the presence of a three-dimensional object in a two-dimensional image containing a two-dimensional representation of the three-dimensional object. The object finder uses the wavelet transform of the input two-dimensional image for object detection. A preselected number of view-based detectors are trained on sample images prior to performing the detection on an unknown image. These detectors then operate on the given input image and compute a quantized wavelet transform for the entire input image. Object detection then proceeds with a sampling of the quantized wavelet coefficients at different image window locations on the input image, and efficient look-up of precomputed log-likelihood tables to determine object presence. The object finder's coarse-to-fine object detection strategy coupled with exhaustive object search across different positions and scales results in an efficient and accurate object detection scheme.
U.S. Pat. No. 7,031,499 of Viola, et al. (hereinafter Viola) disclosed an object detection system for detecting instances of an object in a digital image using an image integrator and an object detector, which includes a classifier (classification function) and an image scanner. The image integrator receives an input image and calculates an integral image representation of the input image. The image scanner scans the image in same sized subwindows. The object detector uses a cascade of homogeneous classification functions or classifiers to classify the subwindows as to whether each subwindow is likely to contain an instance of the object. Each classifier evaluates one or more features of the object to determine the presence of such features in a subwindow that would indicate the likelihood of an instance of the object in the subwindow.
The disclosed method assumes that a stream of detected faces are fed to the system, where face detection is performed by a machine learning based face detection method, similar to the method disclosed in Viola and Schneiderman.
There have been prior attempts for detecting and localizing facial features from facial images for the purpose of further facial image analysis.
U.S. Pat. No. 5,781,650 of Lobo, et al. (hereinafter Lobo) disclosed a method for automatically finding facial images of a human face in a digital image, and classifying the age of the person into an age category. Step 1 of the process is to find facial features of the digital image encompassing the chin, sides of the face, and the virtual top of the head, eyes, mouth and nose of the image. Step 2 is to compute the facial feature ratios of the facial features found in Step 1. Step 3 is to compute a wrinkle analysis of the image. Step 4 is to combine the previous two steps to categorize the age of the facial image. The invention can locate and detect facial images for age classification from digital camera images and computerized generated images.
U.S. Pat. No. 5,852,669 of Eleftheriadis, et al. (hereinafter Eleftheriadis) disclosed a method that responds to a video signal representing a succession of frames, where at least one of the frames corresponds to an image of an object, to detect at least a region of the object. The method includes a processor for processing the video signal to detect at least the region of the object characterized by at least a portion of a closed curve and to generate a plurality of parameters associated with the closed curve for use in coding the video signal.
U.S. Pat. No. 6,219,639 of Bakis, et al. (hereinafter Bakis) disclosed a method for recognizing an individual based on attributes associated with the individual comprising the steps of: pre-storing at least two distinctive attributes of the individual during at least one enrollment session; contemporaneously extracting the at least two distinctive attributes from the individual during a common recognition session; segmenting the pre-stored attributes and the extracted attributes according to a sequence of segmentation units; indexing the segmented pre-stored and extracted attributes so that the segmented pre-stored and extracted attributes corresponding to an identical segmentation unit in the sequence of segmentation units are associated to an identical index; and respectively comparing the segmented pre-stored and extracted attributes associated to the identical index to each other to recognize the individual.
U.S. Pat. No. 7,058,209 of Chen, et al. (hereinafter Chen) disclosed a digital image processing method that detects facial features in a digital image. This method includes the steps of detecting iris pixels in the image, clustering the iris pixels, and selecting at least one of the following schemes to identify eye positions: applying geometric reasoning to detect eye positions using the iris pixel clusters; applying a summation of squared difference method using the iris pixel clusters to detect eye positions; and applying a summation of squared difference method to detect eye positions from the pixels in the image. The method applied to identify eye positions is selected on the basis of the number of iris pixel clusters, and the facial features are located using the identified eye positions.
U.S. Pat. Appl. Pub. No. 2005/0041867 of Loy, et al. (hereinafter Loy) disclosed a method of utilizing a computer system to automatically detect the location of a face within a series of images, the method comprising the steps of: detecting eye like regions within the series of images; utilizing the eye like regions to extract potential face regions within the series of images; enhancing the facial features in the extracted potential face regions; classifying the features; and verifying the face topology within the potential face regions.
In Lobo, the facial feature detection is performed under close range high-resolution frontal face images to extract features for age classification. In Eleftheriadis, the face facial feature detection is used for image compression, by employing edge and model-based scheme. In Bakis, the lip contour registration is performed for the purpose of multi-modal speaker recognition or verification. In Chen, eyes are detected and localized in a human face, based on the iris color signature and the cluster analysis of the iris color pixels. In Loy, eye candidates are detected first using geometric model of eye images. Based on the eye candidate locations, the facial region is detected, and other facial regions are detected and verified using geometric reasoning (facial features topology).
In all of the mentioned prior inventions, either high resolution facial images or good quality color facial images are required to reliably detect facial features. The success of these approaches also depends on successful face detection or initial (mostly eyes) features detection. In the proposed invention, the robust facial localization based on a large number of samples is performed after machine learning-based face detection. The facial features are accurately localized within already roughly localized facial feature windows, again using learning machines trained to localize only each given facial feature. The present method does not require high-resolution images or color information; it works with either gray-level or color images, and works under various imaging conditions due to the training with a large number of images taken under various imaging conditions.
There have been prior attempts for analyzing the skin tone of humans for the purpose of segmenting out facial or skin regions.
U.S. Pat. No. 5,488,429 of Kojima, et al. (hereinafter Kojima) disclosed a method where a flesh-tone area is detected based on color-difference and luminance signals constituting video signals, and luminance correction and color correction. Aperture corrections are performed only on the flesh-tone area or a human face area identified in the flesh-tone area. The setting of a focus area or the setting of a photometric area for iris control, automatic gain control, automatic shutter control, etc., in a video camera, is performed with respect to the flesh-tone area or the human face area. Furthermore, based on the color-difference and luminance signals constituting the video signals, a background area is detected, and the video signals are divided into components representing a background area and components representing an object area. An image signal of a desired hue or a still image is superimposed on the detected background area, or special processing is performed on the video signals representing the object area other than the detected background area.
U.S. Pat. No. 6,711,286 of Chen, et al. (hereinafter Chen-1) disclosed a computer vision/image processing method of removing blond hair color pixels in digital image skin detection for a variety of imaging related applications, such as redeye defects detection. It employs a combination of skin detectors operating in a generalized RGB space in combination with a hue space derived from the original image space to detect skin pixels and blond hair pixels within the skin pixels.
U.S. Pat. No. 6,690,822 of Chen, et al. (hereinafter Chen-2) disclosed a method for detecting skin color in a digital image having pixels in an RGB color space. The method generally includes the steps of performing statistical analysis of the digital color image to determine the mean RGB color values; then, if the mean value of any one of the colors is below a predetermined threshold, applying a transformation to the digital image to move skin colors in the image toward a predetermined region of the color space; and employing the transformed space to locate the skin color pixels in the digital color image. More specifically, if the mean value of any one of the colors is below a predetermined threshold, a non-linear transformation is applied to the digital image to move skin colors in the image toward a predetermined region of the color space. Then, depending on the preceding step, either the digital image or the transformed digital image is converted from the RGB space to a generalized RGB space to produce a gRGB digital image; skin color pixels are detected.
U.S. Pat. Appl. Pub. No. 2006/0066912 of Kagaya (hereinafter Kagaya) disclosed a method for skin tone detection, where skin tone image portion contained in an image is detected based upon the shape of the image of a human face. An average value of each of RGB values of pixels that constitute the skin tone image portion detected is calculated. If the distance between a skin tone-blackbody locus and a value that is the result of converting the RGB values obtained by multiplying the average value by prescribed coefficients is less than a prescribed value, these coefficients are adopted as coefficients for multiplying the RGB values of each pixel constituting the image. By using a value that is the result of conversion to a chromaticity value, the RGB values obtained by multiplying the RGB values of each of the pixels constituting the image by the prescribed coefficients, those pixels of the image that have values belonging to a zone in the vicinity of a point on a gray-blackbody locus that corresponds to a light-source color temperature estimated based upon the skin tone-blackbody locus are treated as gray candidate pixels.
Kojima, Chen-1, Chen-2, and Kagaya all aim to separate out the skin tone region in color space using various color space and color space transforms. We use a similar scheme to detect skin tone, utilizing a combination of both linear and nonlinear mapping in color space. In Chen-1, the method explicitly uses hue-based blonde hair tone removal, to segment facial the skin tone region that doesn't have the blonde hair region. In the present invention, the separation of skin tone and hair tone (which include all human hair tones, not just blonde) is carried out using the linear discriminant analysis on sampled skin and hair tones.
There have been prior attempts for recognizing the gender (or more generally, demographics) category of a person by processing facial images.
U.S. Pat. No. 6,990,217 of Moghaddam, et al. (hereinafter Moghaddam) disclosed a method to employ Support Vector Machines (SVMs) to classify images of faces according to gender, by training the images, including images of male and female faces; determining a plurality of support vectors from the training images for identifying a hyperplane for the gender decision; and reducing the resolution of the training images and the test image by subsampling before supplying the images to the Support Vector Machine.
U.S. patent application Ser. No. 10/972,316 of Agrawal, et al. (hereinafter Agrawal) disclosed a system and method for automatically extracting the demographic information from images. The system detects the face in an image, locates different components, extracts component features, and then classifies the components to identify the age, gender, or ethnicity of the person(s) in the image. Using components for demographic classification gives better results as compared to currently known techniques. Moreover, the described system and technique can be used to extract demographic information in a more robust manner than currently known methods in environments where a high degree of variability in size, shape, color, texture, pose, and occlusion exists. This invention also performs classifier fusion using Data Level fusion and Multi-level classification for fusing results of various component demographic classifiers. Besides use as an automated data collection system wherein, given the necessary facial information as the data, the demographic category of the person is determined automatically, the system could also be used for the targeting of advertisements, surveillance, human computer interaction, security enhancements, immersive computer games, and improving user interfaces based on demographic information.
U.S. patent application Ser. No. 11/811,614 of Moon, et al. (hereinafter Moon) disclosed a face-based automatic demographics classification system that is robust to pose changes of the target faces and to accidental scene variables, such as noise, lighting, and occlusion, by using a pose-independent facial image representation which is comprised of multiple pose-dependent facial appearance models. Given a sequence of people's faces in a scene, the two-dimensional variations, such as position error, size, and in-plane orientation, are estimated and corrected using a novel machine learning based method. The system also estimates the three-dimensional pose of the people, using a conceptually similar machine learning based approach. Then the face tracking module keeps the identity of the person using geometric and appearance cues of the person, where multiple appearance models are built based on the poses of the faces. The separate processing of each pose makes the appearance model building more accurate so that the tracking performance becomes more accurate. Each separately built pose-dependent facial appearance model is fed to the demographics classifier, which is again trained using only the faces having the corresponding pose. The classification scores from the set of pose-dependent classifiers are aggregated to determine the final face category, such as gender, age, and ethnicity of the person, etc.
“A Method of Gender Classification by Integrating Facial, Hairstyle, and Clothing Images,” in the Proceedings of the 17th International Conference on Pattern Recognition, 2004, by Ueki, et al. (hereinafter Ueki) disclosed a method of gender classification by integrating facial, hairstyle, and clothing images. Initially, input images are separated into facial, hairstyle and clothing regions, and independently learned PCAs and GMMs based on thousands of sample images are applied to each region. The classification results are then integrated into a single score using some known priors based on the Bayes rule. Experimental results showed that our integration strategy significantly reduced the error rate in gender classification compared with the conventional facial only approach.
The approach by Moghaddam addresses the problem of gender recognition by training an SVM using a large number of image instances to make use of image features that distinguish male from female. However, it uses a holistic image, which implicitly contains both the shape and geometric features, while the present invention decouples the shape and geometry information so that the separated information can be compared explicitly. In Agrawal, the gender recognition (or demographics recognition in general) is based on comparing individual features, consisting of the indexed and localized feature images and their relative positions. The present invention takes very similar approaches. However, Agrawal does not suggest any automated means for detecting and localizing facial features. The present invention makes use of both automatic face localization and automatic facial feature localization so that the whole process of face detection, localization, facial feature detection and localization, and feature extraction can be performed without any human intervention. On the other hand, the present invention makes use of hairstyle information, which provides very a useful clue for gender recognition, also by performing automated hair-skin separation based on color space analysis. In Moon, a series of geometric estimation for face localization, three-dimensional facial pose estimation, and face tracking and appearance model building are performed to conduct pose-independent demographics classification; the approach focuses more on dealing with pose and depends on using a holistic facial appearance model. While the present invention performs face localization similar to Moon, it also performs explicit facial feature localization to decouple facial feature geometry and facial feature appearance to achieve accurate gender recognition. In Ueki, the gender-specific dress code and hairstyle are exploited for gender recognition, in addition to facial image features. The use of hair features is shared by the present invention. However, in Ueki, the hairstyle extraction is simplified and based on gray-level images due to the dark complexion of the specific ethnicity group, while in the present invention the hair region segmentation can deal with any kinds of skin tone and hair color.
In summary, the present invention provides a fully automatic face localization, facial feature localization, and feature extraction approach, for accurate facial feature-based gender recognition, unlike some of the approaches that use manually extracted features. The facial feature localization approach is different from the prior approach in that it uses multiple learning machines, each tuned to specific geometry of facial features, and can robustly detect and localize facial features under harsh imaging conditions. The basic approach is different from most demographics classification approaches in that the recognition is based on separate geometric and appearance features. It also makes use of automatically extracted hairstyle information using skin tone-hair tone discriminant analysis, which is very specific to gender recognition.