1. Technical Field
This invention is directed toward a face detection system and process for detecting the presence of faces of people depicted in an input image, and more particularly to such a face detection system and process that also identifies the face pose of each detected face.
2. Background Art
Face detection systems essentially operate by scanning an image for regions having attributes which would indicate that a region contains a person""s face. To date, current systems are very limited in that detection is only possible in regions associated with a frontal view of a person""s face. In addition, current detection systems have difficulty in detecting face regions in images having different lighting conditions or faces at different scales than the system was initially designed to handle.
The problem of detecting the faces of people depicted in an image from the appearance of their face has been studied for many years. Face recognition systems and processes essentially operate by comparing some type of training images depicting people""s faces (or representations thereof) to an image or representation of a person""s face extracted from an input image. In the past, most of these systems required that both the original training images and the input image region be essentially frontal views of the person. This is limiting in that to obtain the input images containing a frontal view of the face of the person being identified, that person had to either be purposefully positioned in front of a camera, or a frontal view had to be found and extracted from a non-staged input image (assuming such a frontal view exists in the image).
More recently there have been attempts to build a face detection and recognition system that works with faces rotated out of plane. For example, one approach for recognizing faces under varying poses is the Active Appearance Model proposed by Cootes et al. [1], which deforms a generic 3-D face model to fit the input image and uses control parameters as a feature fed to a classifier. Another approach is based on transforming an input image into stored prototypical faces and then using direct template matching to recognize the person whose face is depicted in the input image. This method is explored in the papers by Beymer [2], Poggio [3] and Vetter [4].
It is noted that in the preceding paragraphs, as well as in the remainder of this specification, the description refers to various individual publications identified by a numeric designator contained within a pair of brackets. For example, such a reference may be identified by reciting, xe2x80x9creference [1]xe2x80x9d or simply xe2x80x9c[1]xe2x80x9d. A listing of the publications corresponding to each designator can be found at the end of the Detailed Description section.
The present invention is directed toward a face detection system and process that overcomes the aforementioned limitations in prior face detection and recognition systems by making it possible to detect a person""s face in input images containing either frontal or non-frontal views of the person""s face, regardless of the scale or illumination conditions associated with the face. Thus, a non-staged image, such as a frame from a video camera monitoring a scene, can be searched to detect a region depicting the face of a person, without regard to whether the person is directly facing at the camera. Essentially, as long as the person""s face is visible in the image being searched, the present face detection system can be used to detect the location of the face in the image. To date there have not been any face detection systems that could detect a person""s face in non-frontal views. In addition, the present invention can be used to not only detect a person""s face, but also provide pose information. This pose information can be quite useful. For example, knowing which way a person is facing can be useful in user interface and interactive applications where a system would respond differently depending on where a person is looking. Having pose information can also be useful in making more accurate 3D reconstructions from images of the scene. For instance, knowing that a person is facing another person can indicate the first person is talking to the second person. This is useful in such applications as virtual meeting reconstructions.
Because the present face detection system and associated process can be used to detect both frontal and non-frontal views of a person""s face, it is termed a pose-adaptive face detection system. For convenience in describing the system and process, the term xe2x80x9cposexe2x80x9d will refer to the particular pitch, roll and yaw angles that describe the position of a person""s head (where the 0 degree pitch, roll and yaw position corresponds to a person facing the camera with their face centered about the camera""s optical axis).
The pose-adaptive face detection system and process must first be trained before it can detect face regions in an input image. This training phase generally involves first capturing images of the faces of a plurality of people. As will be explained later, the captured face images will be used to train a series of Support Vector Machines (SVMs). Each SVM will be dedicated to a particular face pose, or more precisely a pose range. Accordingly, the captured face images should depict people having a variety of face poses. Only those face images depicting a person with a face pose that falls within the particular pose range of a SVM will be used to train that SVM. It is noted that the more diverse the training face images are, the more accurate the detecting capability of the SVM will become. Thus, it is preferred that the face images depict people which are not generally too similar in appearance. The training images can be captured in a variety of ways. One preferred method would involve positioning a subject in front of a video camera and capturing images (i.e., video frames) as the subject moves his or her head in a prescribed manner.
The captured face images are preprocessed to prepare them for input into the appropriate SVM. In general, this will involve normalizing, cropping, categorizing and finally abstracting the face images. Normalizing the training images preferably entails normalizing the scale of the images by resizing the images. It is noted that this action could be skipped if the images are captured at the desired scale thus eliminating the need for resizing. The desired scale for the face images is approximately the size of the smallest face region expected to be found in the input images that are to be searched. In a tested embodiment of the present invention, an image size of about 20 by 20 pixels was used with success. The image could additionally be normalized in regards to the eye locations within the image. In other words, each image would be adjusted so that the eye locations fell within a prescribed area. These normalization actions are performed so that each of the training images generally match as to orientation and size. The images are also preferably cropped to eliminate unneeded portions which could contribute to noise in the upcoming abstraction process. It is noted that the training images could be cropped first and then normalized, if desired. It is also noted that a histogram equalization, or similar procedure, could be employed to reduce the effects of illumination differences in the images that could introduce noise into the detecting process.
The next action in the training image preprocessing procedure involves categorizing the normalized and cropped images according to their pose. One preferred way of accomplishing this action is to group the images into a set of prescribed pose ranges. It is noted that the persons in the training images could be depicted with any combination of pitch, roll and yaw angles, as long as at least a portion of their face is visible. In such a case, the normalized and cropped images would be categorized into pose ranges defined by all three directional angles. The size of these pose ranges will depend on the application and the accuracy desired, but can be readily determined and optimized via conventional means.
The abstracting procedure is essentially a method of representing the images in a simpler form to reduce the processing load associated with the SVM""s detection operation. While many abstraction processes might be employed for this purpose (e.g., histograming, Hausdorff distance, geometric hashing, active blobs, eigenface representations, and others), the preferred method entails the use of wavelet transforms, and particularly the use of three types of non-standard Haar wavelet transforms to represent each normalized, cropped, and categorized training face image, in a manner similar to that discussed in Oren [5]. Oren [5] discusses an image representation which captures the relationship between average intensities of neighboring image regions through the use of a family of basis functions, specifically Haar wavelets, which encode such relationships along different orientations. To this end, three types of 2-dimensional Haar wavelets are employed. These types include basis functions which capture change in intensity along the horizontal direction, the vertical direction and the diagonals (or corners). This Haar wavelet transform process is repeated to produce wavelet coefficients at two different scales, e.g., at 4xc3x974 pixels and 2xc3x972 pixels.
The result of the wavelet transform process is a series of coefficients. For each face pose range, a particular sub-set of these coefficients are selected to form a set of so-called feature vectors, although the same number of coefficients is used to make up each feature vector. It is noted that a different combination of coefficients may be needed to make up a feature vector associated with each pose range group. Thus, each training image is actually represented by a unique set of the computed feature vectorsxe2x80x94one for each of the SVMs, tailored for each face pose range. Furthermore, all these feature vectors will be used to train the ensemble neural network.
This tailoring process begins by calculating the mean coefficients of all the training images. To this end, all the training images depicting a person exhibiting a face pose within a particular pose range are selected. The mean of all the horizontal wavelet coefficients associated with a particular pixel location of the selected training images that were computed under the first scale (e.g., 4xc3x974), and the mean of all the horizontal wavelet coefficients associated with the pixel location that were computed under the other scale (e.g., 2xc3x972), are calculated. The normalizing and cropping steps described previously will create training images that have the same size, and so the same number of corresponding pixel locations. This process is then repeated for each pixel location of the images. The process of computing the means of the wavelet coefficients associated with both scales for each pixel location is then repeated for the vertical wavelet coefficients and the diagonal wavelet coefficients. Thus, once all the coefficient means have been computed, there will be two average horizontal coefficients (i.e., one for each scale), as well as two average vertical and two average diagonal coefficients, associated with each pixel location of the training images. Those mean coefficients that have values outside a prescribed coefficient range are then identified. The pixel location and pedigree (i.e., direction and scale) of the identified mean coefficients are designated as the coefficients that will be used to form a feature vector for the particular pose range associated with the selected training images. The foregoing selection process is then repeated for the training images depicting faces exhibiting each of the remaining pose ranges. In this way a specific group of wavelet coefficients, as would be derived from any image undergoing the previously-described abstraction process, are identified for inclusion in a feature vector representing each of the pose ranges.
The prepared face image representations are used to train a 2-stage classifier which includes a bank of SVMs as an initial pre-classifier layer, and a neural network forming a subsequent decision classifier layer. As indicated previously, the bank of SVMs is composed of a plurality of SVMs each of which is trained to detect faces exhibiting a particular range of poses. To this end, the output of each SVM would indicate whether an input image region is a face having a pose within the SVM""s range, or it is not. While this output could be binary (i.e., yes or no), it is preferred that a real-value output be produced. Essentially, the real-value output of each SVM pre-classifier is indicative of the distance of an input feature vector associated an input image region from a decision hyperplane defined by the face images used to train the SVM. In other words, the output indicates how closely an input feature vector fits into the face class represented by the SVM. Ideally, each SVM would be configured such that an image region that is not a face exhibiting a pose range associated with the SVM, or not depicting a face at all, is indicated by a negative output value. If this were the case, the face detection and pose information could be derived directly from the SVM outputs. However, this ideal is hard to achieve, especially for input regions depicting a face just outside the pose range of an SVM. In reality, such image regions may produce relatively low, but positive output values. Thus, a definitive indication of the pose range could not be made easily. This is where the second stage neural network comes into play. The single neural network forming the second stage of the face detection system architecture acts as a xe2x80x9cfusingxe2x80x9d neural network that combines or fuses the outputs from each of the first stage SVMs. Whereas one SVM alone cannot provide a definitive indication of the face pose range of an input face region, collectively, they can when all their outputs are considered via the fusing inherent in a neural network.
As indicated previously, the system must be trained before detecting faces and identifying face poses in an input image can be attempted. The SVMs are trained individually first, and then the neural network is trained using the outputs of the SVMs. Each SVM is trained by sequentially inputting the feature vectors derived from the training images associated with the same pose range categoryxe2x80x94i.e., the category to which the SVM is to dedicated. As usual the corresponding elements of each feature vector are input into the same input nodes of the SVM. Interspersed with these face image feature vectors are so-called negative examples that the SVM is instructed are not face images. A negative example is a feature vector created in the same way as the face image feature vectors, except that the image used is not of a face exhibiting a pose within the range being associated with the SVM. Initially, the images associated with the negative example vectors preferably depict xe2x80x9cnaturalxe2x80x9d scenes not containing faces. However, a problem can arise in that there is no typical example of the negative class. In other words, the number of different scenes not depicting a face are nearly infinite. To overcome this problem, a xe2x80x9cbootstrappingxe2x80x9d technique is employed. First, it must be noted that the aforementioned training image feature vectors and negative example vectors are input into the SVM repeatedly until the output of the SVM stabilizes (i.e., does not vary outside a prescribed threshold between training iterations for each corresponding inputted vector). Bootstrapping comes into play by introducing face images that have poses that fall outside the designated range of the SVM being trained once the outputs have stabilized. The feature vectors produced from these images are fed into the SVM without any indication that they are negative examples. Whenever one of these negative examples derived from an xe2x80x9cout-of-rangexe2x80x9d face image results in an output that indicates the face image is within the pose range associated with the SVM (i.e., a false alarm), the SVM is instructed that the input is a negative example. Such bootstrapping results in a more accurate SVM. The foregoing training procedure is repeated for each of the SVMs.
Once all the SVMs have been trained, the neural network is brought on-line and the set of feature vectors associated with each respective training image is, in turn, simultaneously input into the appropriate SVM. This is repeated until the outputs of the neural network stabilize. The sequence in which the feature vectors sets are input can be any desired. However, it is believed that inputting the vector sets in random pose range order will cause the neural network to stabilize more quickly. Finally, at least one face image feature vector set representing each pose range is, in turn, simultaneously input into the appropriate SVM, and the active output of the neural network is assigned as corresponding to a face being detected in an input image region and having the pose associated with the training image used to create the feature vector causing the output. The remaining neural network outputs (which will number at least one) are assigned as corresponding to a face not being detected in an input image region.
The system is now ready to accept prepared input image regions, and to indicate if the region depicts a face, as well as indicating the pose range exhibited by the face. To this end, the input image being searched is divided into regions. For example, a moving window approach can be taken where a window of a prescribed size is moved across the image, and at prescribed intervals, all the pixel within the window become the next image region to be tested for a face. However, it is not known what size a face depicted in an input image may be, and so the size of the window must be considered. One way of ensuring that a face of any practical size depicted in an input image is captured in the window is to adopt an image pyramid approach. In this approach the window size is selected so as to be the smallest practical. In other words, the window size is chosen to be the size of the smallest detectable face in an input image. This window size should also match the size chosen for the training face images used to train the system. For a tested embodiment of the present face detection system and process, a window size of 20 by 20 pixels was chosen. Of course, many or all of the faces depicted in an input image will likely be larger than the aforementioned window size. Thus, the window would only cover a portion of the bigger faces and detection would be unlikely. This is solved by not only searching the original input image with the search window (in order to find the xe2x80x9csmallestxe2x80x9d faces), but by also searching a series of reduce scaled versions of the original input image. For example, the original image can be reduced in scale in a stepwise fashion all the way down to the size of the search window itself, if desired. After each reduction in scale, the resulting image would be searched with the search window. In this way, larger faces in the original image would be made smaller and will eventually reach a size that fits into the search window.
Each input image region extracted in the foregoing manner from any scale version of the input image is abstracted in a way similar to the training images to produce a set of feature vectors, which are, in turn, simultaneously input into the appropriate SVMs. For each feature vector set input the system, an output is produced from the neural network having one active node. The active node will indicate first whether the region under consideration depicts a face, and secondly, if a face is present, into what pose range the pose of the face falls.
In an alternate embodiment of the foregoing search procedure, instead of using the input image (or scaled down versions thereof) directly and then abstracting each extracted region, the entire input image (or scaled down versions thereof) could be abstracted first and then each extracted region could be feed directly into the SVMs.