A neutral expressionless face is a relaxed face without contraction of facial muscles and without facial movements. It is the state of people's face most of the time. The appearance of a neutral face is needed for all existing automated facial expression analysis systems. That is, to classify a a facial expression, a generic automated neutral expressionless face appearance is needed and provided by a human operator. Face expression classification then, in general, has three stages: (i) face detection and normalization; (ii) facial feature extraction and representation; and (iii) comparison of the feature representation to a feature representation of the hand-annotated neutral face appearance. In addition, compared to enrolling a face with dramatic expressions in a face based person authentication system, the performance of such face based authentication systems can be much improved by enrolling and authenticating neutral faces.
Face detection and normalization are often used techniques in the general area of image and video processing. Face detection is the first step of many face recognition systems. Face detection is also the first step in facial expression analysis for (say) human-computer interaction. A face detection system finds positions and scales of the faces in images and videos. A robust face detector flexibly and reliably detects the face in the image or video, regardless of lighting conditions, background clutter in the image, multiple faces in the image, as well as variations in face position, scale, pose and expression.
The accurate detection of human faces in arbitrary scenes is the most important process involved. The face component template, skin color, contour, eigenfaces (U.S. Pat. No. 5,164,992 to Turk and Pentland), and other features can be used for face detection. Many face detectors have been developed in past 20 years. Some example algorithms for locating faces in images can be found in (Sung and Poggio) and (Rowley, Baluja, and Kanade).                Kah-Kay Sung and T. Poggio. Learning human face detection in cluttered scenes. In Computer Analysis of Images and Patterns, pages 432-439, 1995. (Sung and Poggio)        Henry A. Rowley, Shumeet Baluja, and T. Kanade. Human face detection in visual scenes. Technical Report CMU-CS-95-158, School of Computer Science, CMU, Pittsburgh, Pa., July 1995. (Rowley, Baluja, and Kanade)        
These references are incorporated by reference in its entirety.
Oftentimes, face normalization is a necessary preprocessing step for face recognition and facial expression analysis. Generally, the face appearance images encompass a great deal of variance in position, scale, lighting because of body and/or head motion, and lighting changes because of environment changes. Thus, it is necessary to compensate or normalize a face for position, pose, scale, and illumination so that the variance due to the above mentioned causes is minimized.
Furthermore, expression and facial detail changes result in changes in the face appearance images and these changes also somehow have to be compensated for.
After the face detection and localization stage there is the face normalization stage. Here the eyes, the nose or the mouth are identified using direct image processing techniques (such as template matching, see below). Assume for now that the line segment between the eyes is known and that the exact location for the nose tip is available. The detection of the location of these feature points (eyes, nose, and mouth) gives an estimate of the pose the face. Once the 2D pose or the 3D position and orientation of the face is known, it is possible to revert the effect of translation and rotation and synthesize a standardized, frontal view of the individual. Furthermore, the position of the feature points allow for a rough segmentation of the contour of the face to discard distracting background information. Once segmented, a color histogram of the face alone can be computed to compensate for lighting changes in the image by transforming the color histogram to some canonical form.
If faces could be exactly detected and located in the scene, the techniques for face authentication, face recognition, or facial expression analysis can be readily applied to these detected face. Face authentication systems verify the identity of particular people in real-time (e.g., in a security monitoring system, location tracking system, etc.), or allow access to some resource to a selected group of enrolled people and deny access to all others (e.g., access to a building, computer, etc.). Multiple images per person are often available for training and real-time identification is, of course, a necessity.
Compared to the problem of face authentication, face recognition/identification is a much more complex problem. Given an image of human face, a face recognition system compares the face appearance to models or representations of faces in a (possibly) large database of identities (e.g., in a police database of mugshots) and reports the identity of the face if a match exists. These systems typically return a list of the most likely matches in the database. Often only one image is available per person. For forensic applications like mugshot searches, it is usually not necessary for face identification to be done in real-time. For background check, for example, on points of entry or exit such as airports, immediate responses are required.
The techniques for face identification can be categorized as either feature-based (geometric) or template-based/appearance-based (photometric), where the latter has proven more successful. Template-based or appearance-based methods use measures of facial similarity based on standard Euclidean error norms (that is, template matching) or subspace-restricted error norms (e.g., weighted eigenspace matching), see U.S. Pat. No. 5,164,992 to Turk and Pentland. The latter technique of “eigenfaces” has in the past decade become the “golden standard” to which other algorithms are often compared.
Facial expressions are one of the most powerful, natural, and immediate means by which human beings communicate their emotions and intentions. The human face expresses emotions faster than people verbalize or even realize their feelings. Many psychologists have been studying human emotions and facial expressions and found that the same expression might have radically different meanings in different cultures. However, it is accepted by 20th century psychologists that six universal expressions (i.e., happiness, sadness, disgust, anger, surprise, and fear) are not changing too much for different cultures. In addition, Ekman and Friesen have developed a Facial Action Coding System (FACS) to describe facial behavior in term of its constituent muscle actions. The details about FACS can be found in (Ekman & Friesen)                P. Ekman and W. V. Friesen, Facial Action Coding System: A Technique for the measurement of Facial Movement. Palo Alto, Calif.: Consulting Psychologists Press, 1978. (Ekman & Friesen)        
This reference in incorporated by reference in its entirety.
In the past decade, much progress has been made to build computer systems that understand and use this natural form of human communication for human-computer interaction. Most of the facial expression analysis systems focus only on the six universal expressions. Recently, some researchers have been working on more subtle facial expression movements based on the Facial Action Coding System from Ekman and Friesen. Facial expression analysis systems have applications in retail environments (happy and unhappy customers), human computer interaction (e.g., the computer reacts to the user's frame of mind), lie detection, surveillance and image retrieval.
Facial feature extraction and building a face representation are important aspects of the field of processing of images and video that contain faces. Multiscale filters, that operate at multiple levels of resolution, are used to obtain the pre-attentive features (features such as edges and small regions) of objects. Based on these features, different structural face models have been investigated to locate the face and facial features, such as eyes, nose and mouth. The structural models are used to characterize the geometric pattern of the facial components. These models, which are texture and feature models, are used to verify the face candidate regions detected by simpler image processing operations. Since the eyeballs (or pupils) are the only features that are salient and have strong invariant property, the distance between these is often used to normalize face appearances for recognition purposes. Motivated by this fact, with the face detected and the structural information extracted, a precise eye localization algorithm is applied using contour and region information. Such an algorithm detects, ideally with a sub-pixel precision, the center and the radius of the eyeballs in the face image. The localized eyes now can be used for an accurate normalization of images, which greatly reduces the number of possible scales that need to be used during the face recognition process. The work by Kanade (Kanade) was the first to present an automatic feature extraction method based on ratios of distances and reported a recognition rate of between 45-75% on a database of 20 people.                T. Kanade, “Picture Processing by Computer Complex and Recognition of Human Faces,” PhD Thesis, Kyoto University, 1973. (Kanade)        
This reference in incorporated by reference in its entirety.
Different facial features have been used for facial image processing systems, for example, face characteristic points, face components, edges, eigenfaces (U.S. Pat. No. 5,164,992 to Turk and Pentland), histograms, and so on.
Face characteristic points are the location of face components. For example, inner corners, of the eyebrows, inner corners of the eyes, outer corner of the eyes, center of nose, lip corners.
Edge detection refers to a class of technologies to identify sharp discontinuities in the intensity profile of images. Edge detectors are operators that compute differences between pairs of neighboring pixels. High responses to these operators are then identified as edge pixels. Edge maps can be computed in a single scan through the image. Examples of edge detection are the Gradient- and Laplacian-type edge finders and edge templates such as Sobel.
Gradient- and Laplacian-type edge finders and edge templates are described more fully in D. Ballard and C. Brown, Computer Vision, Prentice-Hall: N.J., 1982, pages 75-80. (Ballard and Brown a). A histogram is common terminology for a uni-variate (i.e., one-variable) distribution, or, better said, a probability mass distribution. That is, a histogram accumulates the relative frequencies of values of this variable in a one-dimensional array. Several types of histograms can be constructed: categorical, continuous, difference, and comparative. Details of each type of histogram can be found in M. Swain and D. Ballard, “Color indexing,” International Journal of Computer Vision, Vol. 7, No. 1, pp. 11-32, 1991. This reference is incorporated by reference in its entirety.
To determine a histogram for a set of variables measured on a continuous scale, divide the range (the scale) between the highest and lowest value into several bins of equal size. Then increment by 1 the appropriate bin of the histogram for each quantized value in the set. (Each quantized value is associated with one of the histogram bins.) The number in each bin of this frequency histogram represents the number of quantized values in the original set.
Template matching is a general method for localizing and/or recognizing objects. In template matching, a template image represents the object, which is to be located in a one or more target images. This is achieved by matching the template image to all (or many) of the possible locations it could appear in the target image. A distance function (typically a simple Euclidean distance) is applied to the template and the image portion covered by the template to measure the similarity of the template and the image at a given location. The matching algorithm then picks the location with smallest distance as the location of the template image in the target image.
There are several variations to this basic algorithm. A first one is the use of more sophisticated distance functions. This may be necessary for images, which have different overall brightness than the template image or varying brightness. Another set of variations attempts to reduce the number of possible locations which are actually matched. One such method is to use image pyramids. Another method is to only match every few pixels, and then for promising match locations, attempt to match all the pixels in the neighborhood.
Template matching (often also referred to as correlation or normalized correlation), is described fully in D. Ballard and C. Brown, Computer Vision, Prentice-Hall: N.J., 1982, pp. 68-70. (Ballard and Brown b). This reference is incorporated by reference in its entirety.
Classifiers play an important role in the analysis of images and video of human faces. For example, some classifier or several classifiers are used to classify the facial expression based on the extracted face features. To develop a procedure for identifying images or videos as belonging to particular classes or categories (or for any classification or pattern recognition task, for that matter), supervised learning technology can be based on decision trees, on logical rules, or on other mathematical techniques such as linear discriminant methods (including perceptrons, support vector machines, and related variants), nearest neighbor methods, Bayesian inference, neural networks, etc. We generically refer to the output of such supervised learning systems as classifiers.
Most classifiers require a training set consisting of labeled data, that is, representations of previously categorized media items (i.e., face appearances), to enable a computer to induce patterns that allow it to categorize hitherto unseen media items. Generally, there is also a test set, also consisting of labeled data, that is used to evaluate whatever specific categorization procedure is developed. In academic exercises, the test set is usually disjoint from the training set to compensate for the phenomenon of overfitting. In practice, it may be difficult to get large amounts of labeled data of high quality. If the labeled data set is small, the only way to get any useful results at all may be to use all the available data in both the training set and the test set.
To apply standard approaches to supervised learning, the media segments (face appearances) in both the training set and the test set must be represented in terms of numbers derived from the face appearances, i.e., features. The relationship between features extracted for the purposes of supervised learning and the content of a face image/video has an important impact on the success of the enterprise, so it has to be addressed, but it is not part of supervised learning per se.
From these feature vectors, the computer induces classifiers based on patterns or properties that characterize when a face image/video belongs to a particular category. The term “pattern” is meant to be very general. These patterns or properties may be presented as rules, which may sometimes be easily understood by a human being, or in other, less accessible formats, such as a weight vector and threshold used to partition a vector space with a hyperplane. Exactly what constitutes a pattern or property in a classifier depends on the particular machine learning technology employed. To use a classifier to categorize incoming hitherto unseen media segments, the newly arriving data must not only be put into a format corresponding to the original format of the training data, but it must then undergo a further transformation based on the list of features extracted from the training data in the training phase, so that it finally possesses a representation as a feature vector that permits the presence or absence of the relevant patterns or properties to be determined.
Classifying in an automated fashion whether a face has a neutral expression is an important problem. The ability to detect whether a face image is expressionless has, in general, many applications since it eliminates one complicated degree of freedom, the facial expression, from the face image analysis process. The ability of a system to detect a neutral face further directly implies that the system has the capability to detect if there is a dramatic expression on a face.