The described embodiments relate generally to face model fitting, and more particularly face model fitting images from video sources.
Model-based image registration and alignment is used to recognize facial images in computer systems. A facial image may, for example, be compared to a database to identify an image. The facial image is often manipulated (aligned) to allow comparison to the database.
One method for registering and aligning a facial image uses active appearance models (AAM). Face alignment using AAM enables facial feature detection, pose rectification, and gaze estimation. Some facial images are received from a video source that includes multiple images. It is desirable for an AAM method to incorporate images from a video source.
Conventional methods for fitting an AAM to video sequences directly fit the AAM to each frame by using the fitting results, i.e., a shape parameter and an appearance parameter of a previous frame as the initialization of a current frame. Fitting faces of an unseen subject may be difficult due to a mismatch between the appearances of the facial images used for training the AAM and that of the video sequences. This difficulty is especially evident when a facial subject is exposed to varying illumination. The conventional method also registers each frame with respect to the AAM, without registering frame-to-frame across a video sequence.
The shape model and appearance model part of an AAM are trained with a representative set of facial images. The shape model and appearance model are trained from a database of facial images. The distribution of facial landmarks are modeled as a Gaussian distribution that is regarded as the shape model. The procedure for training a shape model is as follows. Given a face database, each facial image is manually labeled with a set of 2D landmarks, [xi, yi] i=1, 2, . . . , v. The collection of landmarks of one image is treated as one observation from the random process defined by the shape model, s=[x1,y1,x2,y2, . . . , xv,yv]T. Eigen-analysis is applied to the observation set and the resulting linear shape model represents a shape as,
                              s          ⁡                      (            P            )                          =                              s            0                    +                                    ∑                              i                =                0                            n                        ⁢                                          p                i                            ⁢                              s                i                                                                        (        1        )            where s0 is the mean shape, si is the ith shape basis, and p=[p1, p2, . . . , pn] are the shape parameters. By design, the first four shape basis vectors represent global rotation and translation. Together with other basis vectors, a mapping function from the model coordinate system to the coordinates in the image observation is defined as W(x;p), where x is a pixel coordinate defined by the mean shape so.
After the shape model is trained, each facial image is warped into the mean shape using a piecewise affine transformation. These shape-normalized appearances from all training images are fed into an eigen-analysis and the resulting model represents an appearance as,
                              A          ⁡                      (                          x              ;              λ                        )                          =                              T            ⁡                          (              x              )                                +                                    ∑                              i                =                0                            m                        ⁢                                          λ                i                            ⁢                                                A                  i                                ⁡                                  (                  x                  )                                                                                        (        2        )            where T is the mean appearance, Ai is the ith appearance basis, and λ=[λ1, λ2, . . . , λm] are the appearance parameters. FIG. 1 illustrates an example of an AAM trained using a subject from a 3D face database. A linear shape model 102 is shown above an appearance model 104.
An AAM can synthesize facial images with arbitrary shape and appearance within the range expressed by the training population. Thus, the AAM can be used to analyze a facial image by finding the optimal shape and appearance parameters such that the synthesized image is as similar to the image observation as possible. This leads to the cost function used for model fitting,
                              J          ⁡                      (                          p              ,              λ                        )                          =                              ∑                          x              ∈                              s                0                                              ⁢                                    [                                                I                  ⁡                                      (                                          W                      ⁡                                              (                                                  x                          ;                          p                                                )                                                              )                                                  -                                  A                  ⁡                                      (                                          x                      ;                      λ                                        )                                                              ]                        2                                              (        3        )            which is the mean-square-error (MSE) between the image warped from the observation I(W(x;p)) and the synthesized appearance model instance A(x;λ). Traditionally this minimization problem is solved by iterative gradient-descent methods that estimate Δp, Δλ, and add them to p,λ.
The Simultaneously Inverse Compositional (SIC) method is an image alignment algorithm that minimizes a distance of the warped image observation and the generic AAM model during the fitting. The SIC method however, is applied to video images by fitting each frame of a video independently from the other frames in the video. A method that registers images frame-to-frame across a video sequence is desired.