The approaches described in this section are approaches that could be pursued, but are not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
Many existing digital image processing algorithms utilize facial recognition and detection techniques in order to identify human faces in a digital image. Identifying human faces is often a necessary or desired step in various image enhancement and image alteration applications. For example, identifying human faces can be used as a step in algorithms that enhance the separation of a subject in the foreground from the background in order to enhance depth of field or to separate the face or a body from the background altogether. Identifying human faces can also be used as a step in image correction algorithms that are used to identify and correct defects in a digital image. For example, by knowing whether an object is a face or not a face, a location of eyes can be estimated and used to increase the number of true positives identified by a redeye removal algorithm and reduce the number of false positives identified by the redeye removal algorithm.
A well-known fast face-detection algorithm is disclosed in U.S. Patent Application Publication No. 2002/0102024, which is hereby incorporated by reference in its entirety for all purposes. That patent application proposes a classifier chain consisting of a series of sequential feature detectors. According to one implementation, a set of training data includes known face-containing images (which are tightly cropped around the faces therein, such that the faces dominate the images' areas), which have been labeled as such, and known face-omitting images, which also have been labeled as such. For each image in the training data, the values of various features (discussed in further detail below) within that image are observed, such that the same features for each such image are observed; in different images, different values may be observed for the same feature. A machine-learning mechanism processes the training data to learn, automatically, numerical ranges into which the values of features of known face-containing images tend to fall and outside of which the values of the corresponding features of known face-omitting images tend to fall; each different feature may be associated with a different numerical range. The machine-learning mechanism generates the classifier chain based on this processing. Each classifier in the classifier chain corresponds to a separate feature and associated numerical range. Classifiers that are more likely to filter out face-omitting images may be placed earlier in the classifier chain than classifiers that are less likely to do so.
Unlabeled images (not in the training data) are subjected successively to the classifiers in the classifier chain in order to determine whether those images probably contain faces. For a given classifier in the classifier chain, a determination is made as to whether the value observed for that classifier's corresponding feature in the unlabeled image falls within the previously machine-learned numerical range associated with that classifier's corresponding feature. As soon as an unlabeled image (or a selected portion thereof) fails to pass a particular classifier in the classifier chain (due to a value of a feature in the image falling outside of the corresponding classifier's numerical range), it is concluded that the image (or the selected portion thereof) probably does not contain a face. Subsequent classifiers in the chain do not thereafter need to be applied to the image (or the selected portion thereof). In order to increase face-detection speed, face-omitting images are eliminated from consideration as early as possible. In one implementation, an image (or a selected portion thereof) is only determined to be likely to contain a face if that image (or selected portion thereof) passes all of the classifiers in the classifier chain.
In order to process images extremely rapidly while achieving high facial feature detection rates, automated face-detection techniques may generate and use an “integral image.” Generation and use of an integral image is described in “Robust Real-Time Face Detection” by Paul Viola and Michael J. Jones, in International Journal of Computer Vision 57(2), 137-154 (2004), which is incorporated by reference herein. An integral image is automatically generated based on a source image (e.g., an image captured by a digital camera). The integral image can be computed from the source image using a few operations per pixel. After the integral image has been computed, image features (called “Haar-like” features by Viola and Jones due to those features' conceptual relatedness to Haar Basis functions) within the corresponding source image can be detected rapidly. Based on the values of certain image features in certain regions of the source image, portions of the source image can be determined to probably represent either facial portions or non-facial portions. If a sufficient quantity of various different Haar-like feature values fall within specified ranges for those features (each such feature's value possibly being compared to a different feature-corresponding range), then it can be reasonably concluded that the area of the image in which all of those Haar-like features occur contains a face. Generation and characteristics of an integral image are discussed in greater detail below, but Haar-like features are briefly discussed first.
Viola and Jones propose the use of three different types of features: two-rectangle features, three-rectangle features, and four-rectangle features. Regarding two-rectangle features, some rectangular region of the source image is divided into two adjacent rectangles. These rectangles may be side-by-side or one on top of the other. The value of the two-rectangle feature is equal to the difference between (a) the sum of the pixel intensities within one of the rectangles and (b) the sum of the pixel intensities within the other of the rectangles. Regarding three-rectangle features, some rectangular region of the source image is divided into three rectangles of equal area; again, these may be side-by-side in a row or one above the other in a column. One of the three rectangles—the center rectangle—will be positioned between the other two outer rectangles. The value of the three-rectangle feature is equal to (a) the sum of the pixel intensities within the center rectangle minus (b) the sum of the pixel intensities within the two outer rectangles. Regarding four-rectangle features, some rectangular region of the source image is divided into four rectangular quadrants. The upper-left quadrant and the lower-right quadrant make up one diagonal quadrant pair, while the upper-right quadrant and the lower-left quadrant make up another diagonal quadrant pair. The value of the four-rectangle feature is the difference between (a) the sum of the pixel intensities within one of these diagonal quadrant pairs and (b) the sum of the pixel intensities within the other of these diagonal quadrant pairs.
The process of determining features of the source image typically involves calculating various sums of pixel intensity values within various different rectangular regions of the source image. A less effective approach for computing these sums might involve scanning all of the pixels of each such rectangular region in the source image separately for each such rectangular region. Since such rectangular regions may overlap, this approach would detrimentally involve the probable repeated scanning of certain source image pixels multiple times—once for each separate rectangular region in which that source image pixel occurred. Fortunately, after the integral image has been generated, such repetitive scanning of source image pixels can largely be avoided.
The rectangular features discussed above can be computed more rapidly using an integral image than using a source image directly. Each pixel of the integral image corresponds to a similarly located (in the same column and row) pixel in the source image, such that the integral image has the same pixel height and width as the source image. However, for each particular pixel in the integral image, the intensity value of that particular pixel is equal to the sum of the intensity values of all of the source image's pixels occurring within the rectangular region that occurs above and to the left of, and including, the particular pixel's corresponding position. For each particular pixel in the integral image, that particular pixel's corresponding rectangular region's upper-left corner is the source image's pixel at the upper-left corner of the source image, and that particular pixel's corresponding rectangular region's lower-right corner is the source image's pixel that is located, in the source image, at the same position in which the particular pixel is located in the integral image. Expressed mathematically,
            ii      ⁡              (                  x          ,          y                )              =                  ∑                                            x              ′                        ≤            x                    ,                                    y              ′                        ≤            y                              ⁢                          ⁢              i        ⁡                  (                                    x              ′                        ,                          y              ′                                )                      ,where ii(x,y) is the integral image, and i(x,y) is the source image.
The integral image may be viewed as a two-dimensional array of intensity values in which one dimension's size is equal to the integral image's width, and in which the other dimension's size is equal to the integral image's height. For any rectangular region of the source image, the sum of the pixel intensity values in that rectangular region can be determined mathematically using just a few values from the array representation of the corresponding integral image. For example, for any given rectangular region of having an upper-left corner at position A, an upper-right corner at position B, a lower-left corner at position C, and a lower-right corner at position D, the sum of the source image's pixel intensity values in that rectangular region can be computed quickly by adding the values at positions A and D in the array to generate a first sum, adding the values at positions B and C in the array to generate a second sum, and then subtracting the second sum from the first sum.
A face-recognition algorithm may behave differently when applied to the same face under different illumination conditions. When a face is illuminated uniformly, a face-recognition algorithm may detect facial features (e.g., eyes) more correctly than when the same face is illuminated only partially—such as when one side of the face is illuminated by a light source to the side of that face, leaving the other side of the face in relative shadow. Extreme cases such as underexposure (caused by lowlight and backlight) and overexposure will decrease contrast on the face. Self-shadows caused by a directional illuminant or shadows introduced by foreign objects are more unpredictable because they can change the appearance of facial features. As is discussed above, when Haar-like features are used to detect faces within a source image, the geometric relationships and contrast information between adjacent rectangular regions are extracted. The Haar-like feature function will translate the illumination and geometric information to a numerical value. If the same Haar-like feature function is evaluated under different lighting conditions, the numerical result will be at least slightly different.
FIGS. 2A and 2B illustrate an example of the same face under two different lighting conditions. A left face 202 is uniformly illuminated, having little contrast between the left and right sides of face 202. In contrast, a right face 204 is illuminated from the right side, leaving the left side of face 204 in shadow, thus producing a high and non-uniform contrast between the left and right sides of face 204; right face 204 becomes progressively darker proceeding from the right side toward the left side. Thus, in left face 202, the difference computed between (a) the sum of pixel intensity values of rectangle 206 and (b) the sum of pixel intensity value of rectangle 208 will be relatively small; the illumination difference between rectangles 206 and 208 is mostly due to the fact that rectangle 208 contains an eye while rectangle 206 does not. In contrast, in right face 204, the difference computed between (a) the sum of pixel intensity values of rectangle 210 and (b) the sum of pixel intensity values of rectangle 212 will be relatively larger; the larger difference in illumination between rectangles 210 and 212 is due not only to the fact that rectangle 212 contains an eye while rectangle 210 does not, but also due to the fact that the rectangle 210 is generally darker than rectangle 212. This may be the reverse of the case with left face 202, in which the presence of the eye within a rectangle caused that rectangle's average pixel luminance to be darker, not lighter, than that of the rectangle adjacent to it. The values of the Haar-like features of the same face, under different lighting conditions, will be different, causing a face-detecting algorithm potentially to produce different results even though the same person's face is being evaluated in both images. Clearly, a face should be detected in both images.
Ideally, under circumstances in which the same face is being evaluated in different lighting conditions, the values of the same Haar-like features (e.g., two-rectangle features positioned over the ocular region of the face, as in FIGS. 2A and 2B), would be similar. In order to minimize the variance in Haar-like feature values that is caused by differing lighting conditions, one corrective approach divides each source image's feature value by the statistical variance of all of the pixel intensity values within the source image's entire facial region (which might, in some cases, include all of the source image's pixels). If this statistical variance is low, as would be the case with a uniformly illuminated, low contrast face such as face 202, then the division will have a relatively minor effect on the feature values. In contrast, if this statistical variance is high, as would be the case with a non-uniformly illuminated, high contrast face such as face 204, then the division will have a relatively major effect on the feature values. The division is performed both during the machine-learning procedure, relative to labeled images, and during the classifier-applying procedure, relative to unlabeled images. The division has a normalizing effect on the feature values, so that feature values in non-uniformly illuminated faces will tend to fall within the same numerical ranges as corresponding feature values in uniformly illuminated faces; without such normalization, the learned numerical ranges would probably be so broad as to reduce greatly their discriminatory ability when applied to unlabeled images.
Unfortunately, this corrective approach, involving division by statistical variance, suffers from the fact that the computation of the statistical variance of all of the pixel values within the source image's entire facial region is memory resource-expensive, processing time-consuming, and not very conducive to real-time face detection applications. This is the case even when the integral image is used to reduce the quantity of computations needed to calculate the statistical variance. The integral image itself can be viewed as an element of the process that consumes significant memory resources. The generation of the integral image can be viewed as an element of the process that consumes significant processing time.