1. Field of Invention
The present invention relates to an active appearance model, AAM, machine and method. More specifically, it relates to an AAM approach whose training phase creates multiple smaller AAMs capable of aligning an input test image more quickly through uses of multiple small AAM as opposed to a single large AAM, and is further able to align a larger range of input test images than typical by providing better support for outlier true examples of a class of object.
2. Description of Related Art
In the field of computer vision, it is generally desirable that an image not only be captured, but that a computer be able to identify and label various features within the captured image. Basically, a goal of computer vision is for the computer to “understand” the content of a captured image.
Various approaches to identifying features within a captured image are known. Early approaches centered on the concept of identifying shapes. For example, if a goal was to identify a specific item, such as wrench or a type of wrench, then a library of the different types of acceptable wrenches (i.e. “true examples” defined as images of “true” wrenches) would be created. The outline shapes of the wrenches within these true examples would be stored, and a search for the acceptable shapes would be conducted on a captured image. This approach of shape searching was successful when one had an exhaustive library of acceptable shapes, the library was not overly large, and the subject of the captured images did not deviate from the predefined true shapes.
For complex searches, however, this approach is not effective. The limitations of this approach become readily apparent when the subject being sought within an image is not static, but is prone to change. For example, a human face has definite characteristics, but does not have an easily definable number of shapes and/or appearances it may adopt. It is to be understood that the term appearance is herein used to refer to color and/or light differences across an object, as well as other surface/texture variances. The difficulties in understanding a human face becomes even more acute when one considers that it is prone to shape distortion and/or change in appearance within the normal course of human life due to changes in emotion, expression, speech, age, etc. It is self-apparent that compiling an exhaustive library of human faces and their many variations is a practical impossibility.
Recent developments in image recognition of objects that change their shape and appearance, such as a human face, are discussed in “Statistical Models of Appearance for Computer Vision”, by T. F. Cootes and C. J. Taylor (hereinafter Cootes et al.), Imaging Science and Biomedical Engineering, University of Manchester, Manchester M13 9PT, U.K. email: t.cootes@man.ac.uk, http://www.isbe.man.ac.uk, Mar. 8, 2004, which is hereby incorporated in its entirety by reference.
Cootes et al. explain that in order for a machine to be able to understand what it “sees”, it must make use of models that describe and label the expected structure being imaged. In the past, model-based vision has been applied successfully to images of man-made objects, but their use has proven more difficult in interpreting images of natural subjects, which tend to be complex and variable. The main problem is the variability of the subject being examined. To be useful, a model needs to be specific, that is, it should represent only true examples of the modeled subject. To identify a variable object, however, the model needs to be general and represent any plausible true example of the class of object it represents.
Recent developments have shown that this apparent contradiction can be handled by statistical models that can capture specific patterns of variability in shape and appearance. It has further been shown that these statistical models can be used directly in image interpretation.
To facilitate the application of statically models, subjects to be interpreted are typically separated into classes. This permits the statistical analysis to use prior knowledge of the characteristics of a particular class to facilitate its identification and labeling, and even to overcome confusion caused by structural complexity, noise, or missing data.
Additionally, in order to facilitate further processing of identified and labeled subjects within a captured image, it is beneficial for the identified subject to be transformed into (i.e. be fitted onto) a “model” or “canonical” shape of the class of object being sought. Preferably, this model, or canonical, shape would be of predefined shape and size, and have an inventory of labels identifying characteristic features at predefined locations within the predefined shape. For example, although the human face can vary widely, it can be conformed to a standard shape and size. Once conformed to the standard shape and size, the transformed face can then be further processed to determine its expression, its gaze direction, the individual to whom the face belongs, etc.
A method that uses this type of alignment is the active shape model. With reference to FIG. 1, the active shape model uses a predefined model face 1A and a list of predefined deformation parameters, each having corresponding deformation constraints, to permit the model face to be stretched and move to attempt to align it with a subject image 2. Equivalently, the list of predefined deformation parameters may be applied to subject image 2, and have it be moved and deformed to attempt to align it with model face 1A. This alternate approach has the added benefit that as subject image 2 is being aligned with model face 1A, it is simultaneously being fitted to the shape and size of model face 1A. Thus, once alignment is complete, the fitted image is already in a preferred state for further processing.
For illustrative purposes, FIG. 1 shows model face 1A being fitted to subject face 2. The example of FIG. 1 is an exaggerated case for illustration purposes. It is to be understood that a typical model face 1A would have constraints regarding its permissible deformation points relative to other points within itself. For example, if aligning the model face meant moving its left eye up one inch and moving its right eye down one inch, then the resultant aligned image would likely not be a human face, and thus such a deformation would typically not be permissible.
In the example of FIG. 1, model face 1A is first placed roughly within the proximity of predefined points of interest, and typically placed near the center of subject face 2, as is illustrated in image 3. By comparing the amount of misalignment resulting from moving model face 1A in one direction or another, and the results of adjusting a size multiplier in any of several predefined directions, one can determine how to better align model face 1A, as illustrated in image 4. An objective would be to align as closely as possible predefined landmarks, such as the pupils, nostril, mouth corners, etc., as illustrated in image 5. Eventually, after a sufficient number of such landmark points have been aligned, the subject image 2 is warped onto model image 1A resulting in a fitted image 6 of predefined shape and size with identified and labeled points of interest (such as outlines of eye features, nose features, mouth features, cheek structure, etc.) that can be further processed to achieve specific objectives.
This approach, however, does not take into account changes in appearance; such as for example, changes in shadow, color, or texture. A more holistic, or global, approach that jointly considers the object's shape and appearance is the Active Appearance Model (AAM). Although Cootes et al. appear to focus primarily on the gray-level (or shade) feature of appearance, they do describe a basic principle that AAM searches for the best alignment of a model face (including both model shape parameters and model appearance parameters) onto a subject face while simultaneously minimizing misalignments in shape and appearance. In other words, AAM applies knowledge of the expected shapes of structures, their spatial relationships, and their gray-level appearance (or more generally color value appearance, such as RGB values) to restrict an automated system to plausible interpretations. Ideally, AAM is able to generate realistic images of sought objects. An example would be a model face capable of generating convincing images of any individual, such as by changing the individual's expression. AAM achieves this by formulating interpretation as a matching problem: given an image to interpret, structures are located and labeled by adjusting the model's parameters in such a way that it generates an “imagined image” that is as similar as possible to a plausible variation.
Although AAM is a useful approach, implementation of AAM still poses several challenges. For instance, as long as the AAM machine manages to find a “fit” within its defined parameters, it will assume that the fitted image is a match, (i.e. a true example of a plausible variation). However, there is no guarantee that the fitted image is in fact a true example.
In other words, even if an AAM machine appears to have aligned a subject input image with a model image, the resulting aligned image may not be a true representation of the class of object being sought. For example, if the initial position of the model image is too far misaligned from the subject input image, the model image may be aligned incorrectly on the subject input image. This would result in a distorted, untrue, representation of the warped output image.
Other limitations of an AAM machine relate to the computing complexity required to apply statistical analysis to a training library of true samples, in order to define distinguishing parameters and define the parameter's permissible distortions. By the nature of the applied statistical analysis, the results will permit alignment only with a fraction of the images within the training library. If the class of object being sought is prone to wide variation, it may not be possible to properly align a shape model image or an appearance model image to an input subject image that has characteristics beyond a norm defined by the statistical analysis. This is true of even images within the training library from which the shape model image and appearance model image are constructed. Typically, the constructed model image will be capable of being aligned to only 90% to 95% of the sample images within a training library.