In the modern world, photographs, and particularly digital images can be acquired by numerous sources, including: digital cameras, camcorders, cell phones (phone cams), web cams, and video phones. Personal consumer images, whether of themselves, family, or friends, are generated using these and other devices. Thus, with the advent of digital photography, consumers are amassing large collections of digital images and videos. Currently, the length of time spanned by a typical consumer's digital image collection is only a few years. However, the average number of images captured with digital cameras per photographer is still increasing each year. Unfortunately, the organization and retrieval of images and videos from electronic data file collections is already a problem for the typical consumer. The organization and retrieval problem will continue to grow as the length of time spanned by the average digital image and video collection increases.
A user often desires to find images and videos containing a particular person of interest. The user can perform a manual search to find images and videos containing the person of interest. However this is a slow, laborious process. Even though some commercial software (e.g., Adobe Album) allows users to tag images with labels (metadata) indicating the people in the images so that searches can later be done, the initial labeling process is still very tedious and time consuming.
Digital images can also be tagged using algorithmic methods that target various search and sorting strategies. For example, digital pictures can be searched and sorted based upon event classifications or people recognition and identification. Other reference data, such as time and date, location (including GPS-enabled), object or pet recognition, and capture condition data can be used to support the search, sorting, and tagging processes. Of course, in many cases, people recognition and identification is particularly important.
Face recognition software assumes the existence of a ground-truth labeled set of images (i.e., a set of images with corresponding person identities). Most consumer image collections do not have a similar set of ground truth images. The ground-truth labeled set of images can be based upon reference images, from which derived quantitative data representing key facial attributes or features, can be used as identity markers. In addition, the labeling of faces in images is complex because many consumer images have multiple persons in them. So simply labeling an image with the identities of the people in the image does not indicate which person in the image is associated with which identity. Recognition of people in images (still or video) can also be facilitated using other cues, including eye, skin, or hair color, presence and geometry of eyewear, color and pattern in clothing (apparel), traits of physical motion, and voice characteristics (prosody).
Automatic recognition of individuals in consumer still images, including images from photo-collections, as well as typical consumer video images, is complicated by the unconstrained nature of these images. As consumer images are not captured using the sensibilities of a professional photographer, the framing, pose, and lighting may be less than optimal, which can complicate later identification efforts. Moreover, as consumers are unconstrained in their picture taking events, the settings, appearance, backgrounds and foregrounds, and user activities are very diverse as compared to posed studio photography. Multiple people can often be found in the same frame, and occlusion or partial obscuration of an individual (particularly their face) frequently occurs.
One approach to enable the identification of people using previously captured still images is described in the commonly assigned U.S. patent application Ser. No. 11/755,343, by Lawther et al. Lawther '343 anticipates a person recognition method that works with photo-collections spanning multiple events or sub-events. A set of images is analyzed, with people and faces being located and counted, and then characterized relative to relevant features (face, pose, hair, etc.). An interactive person identifier algorithm is used to identify unique faces in the images. If an image contains a person that the database has no record of, the interactive person identifier displays the identified face with a circle around it in the image. Thus, a user can label the face with the name and any other appropriate types of data. However, if the person has appeared in previous images, data associated with the person can be retrieved for matching, using person classifier algorithms and personal profile data. Such recorded distinctions include person identity, event number, image number, face shape, face points, Face/Hair Color/Texture, head image segments, pose angle, 3-D models and associated features. The method of the Lawther '343 application, attempts to use facial data collected from multiple images taken during an event or multiple time proximate image capture events, to construct a composite model of at least a portion of the particular person's head (face). An image capture event can be a singular occurrence in space and time, or a series of led events or sub-events that fall within a larger super-event. Lawther '343 then anticipates that the composite model of an individual can be used to identify images of that individual in photos captured during subsequent time proximate capture events. Lawther '343 further anticipates that if substantial time gaps occur between use of the composite model and subsequent image capture events, that the composite model can be morphed to compensate for changes in facial characteristics.
Commonly assigned U.S. Pat. Nos. 6,606,411 and 6,351,556, both by A. Loui et al., disclose algorithms for clustering image content by temporal events and sub-events. U.S. Pat. No. 6,606,411 teaches that events have consistent color distributions, and therefore, these pictures are likely to have been taken with the same backdrop. For each sub-event, a single color and texture representation is computed for all background areas taken together. The above patents teach how to cluster images and videos in a digital image collection into temporal events and sub-events. The disclosures of the above patents are hereby incorporated by reference in their entirety. The terms “event” and “sub-event” are used in an objective sense to indicate the products of a computer mediated procedure that attempts to match a user's subjective perceptions of specific occurrences (corresponding to events) and divisions of those occurrences corresponding to sub-events). A collection of images can be classified into one or more events, based on time or date clustering and texture comparison mapping of the images. The plurality of images is separated into the events based on having one or more identified boundaries between events, where the boundaries correspond to the one or more largest time differences. For each event, sub-events (if any) can be determined by comparing the color histogram information of successive images as described in U.S. Pat. No. 6,351,556. Dividing an image into a number of blocks and then computing the color histogram for each of the blocks accomplishes this. A block-based histogram correlation procedure is used as described in U.S. Pat. No. 6,351,556 to detect sub-event boundaries.
Taken together, the approaches described in the Loui '411 and Loui '556 patents can be used to cluster digital images into relevant photo collections, and the composite face model method of the Lawther '343 application can be used as an aid in recognizing individuals within the digital images of the photo collections. However, the face recognition method of Lawther '343 is vulnerable to misidentification of individuals over time, as their facial characteristics change in ways not properly compensated by morphing of a composite model (or other facial models).
By comparison, commonly assigned U.S. Patent Publication No. US 2006/0245624 A1 (U.S. patent application Ser. No. 11/116,729), by Gallagher et al., entitled: “Using Time in Recognizing Persons in Images”, anticipates a process of photo recognition that utilizes different facial models of an individual for recognition, based upon the age of the individual. In particular, Gallagher '624 anticipates that an appearance model generator generates a series of appearance models for an individual over the course of time, such that a set of appearance models spans a period of an individual's life. For example, an additional appearance model may be generated periodically every year or every five years, depending on the age of the person. A set of appearance models for an individual spanning a period of life can be subsequently used to recognize the individual in pictures from that time span. In particular, an individual recognition classifier uses the image capture time associated with a set of images and the features of an appearance model having an associated time that is associated with a particular person of interest, to produce a person classification describing the likelihood or probability that the detected person is the person of interest. More generally, these appearance models can then be used to identify the individual in prior or subsequent captured consumer still images including that person.
Notably, Gallagher '624, in a fashion similar to Lawther '343 anticipates that the appearance models of a person will be assembled using collections of time-clustered and labeled (user verified) images that include that person. Thus, while Lawther '343 anticipates adapting to changes in personal appearance over time by morphing the facial (composite) models; Gallagher '624 anticipates pro-actively generating new facial models periodically, according to a schedule based on user age. However, both Lawther '343 and Gallagher '624 use images from photo-collections to build their facial or composite models, as these images become available. Neither of these approaches anticipates pro-actively assessing the need to generate new recognition models for individuals in response to changes in their appearance, nor pro-actively generating the models in response to the recognized need.
It is noted that a variety of complimentary or competing facial recognition models have been developed in recent years. A rather complete survey of recognition models is provided by the paper: “Face Recognition: A Literature Survey”, by W. Zhao, R. Chellappa, P. J. Phillips, and A. Rosenfeld; which was published in ACM Computing Surveys, Vol. 35, pp. 399-458, 2003.
The first proposed facial recognition model is the “Pentland” model, which is described in: “Eigenfaces for Recognition”, by M. Turk and A. Pentland, in the Journal of Cognitive Neuroscience, Vol. 3, No. 1, pp. 71-86, 1991. The Pentland model is a 2-D model intended for assessing direct-on facial images. The utility of this model can be limited for consumer pictures, as subjects can be oriented any which way. This model throws out most facial data and keeps data indicative of where the eyes, mouth, and a few other features are. These features are located by texture analysis. This data is distilled down to Eigen vectors (direction and extent) related to a set of defined face points (such as eyes, mouth, and nose) that model a face. As the Pentland model requires accurate eye locations for normalization, it is sensitive to pose and lighting variations. Although the Pentland model works, it has been much improved upon by newer models that address its limitations.
The Active Shape Model (ASM) is another facial model useful for recognizing people in images. The ASM, which is a 2-D facial model with faces described by a series of facial feature points, was described in the paper: “Active Shape Models—Their Training and Application”, by T. F. Cootes, C. J. Taylor, D. Cooper, and J. Graham; published in Computer Vision and Image Understanding, No. 61, pp. 38-59, January 1995. As originally discussed by Cootes et al., the ASM approach can be applied to faces as well as other shapes or objects. For faces, Cootes et al. only anticipated using face points related to the eyes, nose, and mouth. However, in the previously mentioned 2002 paper by Bolin and Chen, the application of ASM for face recognition was enhanced with an expanded collection of facial feature points, and in particular the 82 facial feature point model depicted in FIG. 5a. Localized facial features can be described by distances between specific feature points or angles formed by lines connecting sets of specific feature points, or coefficients of projecting the feature points onto principal components that describe the variability in facial appearance. These arc-length features are divided by the interocular distance to normalize across different face sizes. The facial measurements used here are derived from anthropometrical measurements of human faces that have been shown to be relevant for judging: gender, age, attractiveness, and ethnicity. Point PC is the point located at the centroid of points 0 and 1 (i.e. the point exactly between the eyes). The accompanying Table 1 and Table 2 describe a series of linear and arc length facial features that can be quantified using the 82 facial feature point model shown in FIG. 5a. A more complete listing of derivable facial features is given in the commonly assigned Das '308 application (equivalent to U.S. Patent Publication No. US 2005/0111737 A1).
TABLE 1List of Arc Length FeaturesNAMECOMPUTATIONMandibular ArcArc (P69, P81)Supra-Orbital Arc(P56-P40) + Int (P40, P44) + (P44-P48) +Arc (P48, P52) + (P52-P68)Upper-Lip ArcArc (P23, P27)Lower-Lip ArcArc (P27, P30) + (P30-P23)
TABLE 2List of Ratio FeaturesNAMENUMBERATORDENOMINATOREye-to-Nose/Eye-to-MouthPC-P2PC-P32Eye-to-Mouth/Eye-to-ChinPC-P32PC-P75Head-to-Chin/Eye-to-MouthP62-P75PC-P32Head-to-Eye/Eye-to-ChinP62-PCPC-P75Head-to-Eye/Eye-to-MouthP62-PCPC-P32Nose-to-Chin/Eye-to-ChinP35-P75PC-P75Mouth-to-Chin/Eye-to-ChinP35-P75PC-P75Head-to-Nose/Nose-to-ChinP62-P2P2-P75Mouth-to-Chin/Nose-to-ChinP35-P75P2-P75Jaw Width/Face WidthP78-P72P56-P68Eye-Spacing/Nose WidthP07-P13P37-P39Mouth-to-Chin/Jaw WidthP35-P75P78-P72
This expanded active shape model is more robust than the Pentland model, as it can handle some variations in lighting, and pose variations ranging out to 15 degrees pose tilt from normal. Notably, the ASM does not use or model texture based data, such as that related to hair and skin.
As a further progression in recognition models, the active appearance model (AAM) expands upon the ASM approach by complementing the geometry data and analysis with texture data. The texture data, which is high frequency data related to wrinkles, hair, and shadows, can be applied to each facial location. The ASM approach is described in: “Constrained Active Appearance Models”, by T. F. Cootes and C. J. Taylor, published in the 8th International Conference on Computer Vision, Vol. 1, pp. 748-754, IEEE Computer Society Press, July 2001. The AAM approach utilizes more information, and thus is more robust than the ASM approach for identification and recognition. The AAM approach is used in the previously discussed and commonly assigned Gallagher '624 patent application. However, the AAM is only a 2-D model, and is more sensitive to lighting and pose variations than the ASM approach, which limits its use to frontal pictures only.
By comparison “composite” models 360 represent an advance of facial recognition models to a 3-D geometry that maps both the face and head. The composite model approach was introduced in: “Face Recognition Based On Fitting A 3-D Morphable Model”, by V. Blanz and T. Vetter, which was published in IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 25, pp. 1063-1074, 2003. In general, this concept extends the facial feature point approach of the ASM over multiple poses. As described by Blanz and Vetter, a reference 3-D facial model and facial characteristics of the human face was created by scanning multiple faces with a light beam from a laser scanner. Thereafter, for a new subject, a collection of pictures of the person is acquired at different poses under a range of lighting conditions. Then, a 3-D model of that person can be generated by fitting their facial shape and texture data to the reference 3-D facial model. Using the person-specific 3-D facial model, that individual can be subsequently recognized in a new picture. This approach does provide accurate pose estimation, with ˜1° resolution. But the recognition process is slow with current software, as it takes several minutes to match anyone.
The previously cited, commonly assigned Lawther '343 application provides an alternate approach to Blanz and Vetter for creating a 3-D facial model (composite model) for photo-analysis. In particular, in Lawther '343, the subjects are not constrained by an image acquisition process involving a sequence of multiple poses and lighting conditions. Rather, the process of Lawther '343 attempts to generate a 3-D composite model 360 for an individual from a collection of existing pictures. The pictures, which can vary by pose or lighting, are analyzed to retrieve the available facial feature points (see FIG. 5a) according to the expanded ASM approach of Bolin and Chen. A composite model 360 of the individual is then generated by mapping the facial feature points from the multiple pictures into a “master” composite model 360. Of course, this approach is sensitive to the potential of missing data in the available collection of pictures. That is, if a collection of pictures of an individual lacks certain poses, this approach cannot compensate, and can at best interpolate, but with reduced accuracy. As a result, an exemplary partially complete composite model, assembled with only frontal and right side images, is lacking key data to support recognition of that individual relative to an image with a left side facial pose.
Of course, the success rate of facial recognition models in image recognition tasks will decrease when the image assessment is applied to back of the head images with little actual facial data. In such instances, an appearance model that accounts for the texture and shape of the hair can be useful. One such approach is described in the paper “Detection and Analysis of Hair”, by Y. Yacoob and L. David, published in IEEE Trans. on PAMI, Vol. 28, pp. 1164-1169, 2006.
An exemplary pose estimation modeling approach is described in “Head Pose Determination From One Image Using a Generic Model”, by Shimizu et al., published in the Proceedings IEEE International Conference on Automatic Face and Gesture Recognition, 1998. In this approach, edge curves (e.g., the contours of eyes, lips, and eyebrows) are first defined for the 3-D model. Next, an input image is searched for curves corresponding to those defined in the model. After establishing a correspondence between the edge curves in the model and the input image, the head pose is estimated by iteratively adjusting the 3-D model through a variety of pose angles and determining the adjustment that exhibits the closest curve fit to the input image. The pose angle that exhibits the closest curve fit is determined to be the pose angle of the input image.
As implied in the previous discussion, the appearance of people tends to change over time, due to aging, behavioral factors (use of cosmetics, tanning, hair style changes), exercise, health factors, or other reasons. As a result, recognition of individuals in photographs or digital images is impeded, as pre-existing facial or composite models become inaccurate. Presently, the ground truth linkage of identity with image data, and particularly facial image data, requires continuing intermittent input from the users. Although approaches, such as that of Gallagher '624 may improve person recognition by updating facial models according to a reasonable schedule, dramatic facial changes between scheduled updates to the models can reduce the success rate. Thus, a method for acquiring ongoing images or image derived data of individuals of known identity, and applying this image data to facial or head recognition models, can enable more robust or persistent identification of the individuals in ongoing or subsequent images. Preferably, such a method would utilize, update, and support one or more facial recognition models, including composite models 360.