The ability to find objects, and in particular the shape of objects, in images is important for a large number of applications. These applications include object detection, recognition, classification, verification, and tracking. There are needs to find objects in photographs, as well as medical imagery, and video. Specific examples of such applications include identifying the locations of facial features for portrait retouching and red-eye removal, locating the boundary of the lungs or the borders of the breast in x-ray images for computer aided diagnosis, and eye tracking in video for immersive displays.
A useful way to identify the shape of an object in an image is by locating a set of feature points. These points are often designated to indicate the positions of semantically meaningful or readily recognizable locations. Examples include the center of an eye or the tip of a nose, or a series of points that indicate a contiguous border such as the outline of a face.
Early methods for detecting feature points sought to identify each feature point in isolation. One such method is proposed in the paper by Pentland et al., “View-Based and Modular Eigenspaces for Face Recognition,” Proc. IEEE Conf. Computer Vision and Pattern Recognition, pp. 84-91, 1994. In their work, Pentland et al. create a model of the expected appearance at a feature point using a principle components analysis (PCA) of a set of ground truth images. This model describes the space of expected appearances at a feature point by the mean appearance, the primary modes of appearance variation, and the expected range along these modes. Feature locations are found by investigating various image positions and selecting the one with the lowest distance to feature space (i.e. the minimal error between the appearance at the position and the closest appearance realizable using the model).
The location of each feature point can provide useful information about the positions of the other feature points. Finding each feature point individually fails to take advantage of this and generally leads to less reliable results. Modern methods for finding objects therefore incorporate a model of the shape of the object. This model can be used to constrain the results for individual feature points so that they conform to the expected shape of the entire object.
A popular method that employs such a shape model is described in Cootes et al., “Active Shape Models—Their Training and Application,” Computer Vision and Image Understanding, Vol. 61, No. 1, pp. 38-59, 1995. In the active shape model technique, the positions of feature points are manually annotated on a set of ground truth images of an object. These feature locations are analyzed using PCA to develop a model of the shape. This model indicates the plausible relative positions of the feature points and the variability of these positions as an interdependent set. At each feature point an independent model of the local appearance around the point is also created. In order to automatically find an object in an image, a search is performed for each feature point to find the position that best matches the expected local appearance of that feature. The global shape model is then used to constrain the results of the local searches. This process repeats until the shape converges upon a stable result.
A number of other techniques have been suggested for finding objects using local appearance matching and shape model constraints. The use of deformable templates was suggested in the paper Yuille et al., “Feature Extraction from Faces using Deformable Templates,” IEEE Conf on Computer Vis. and Pat. Recog., pp. 104-109, 1989. Deformable templates use a parameterized shape model and an energy minimization technique to find the best match of the shape model to the local appearance of the image. In U.S. Pat. No. 6,222,939 (Wiskott et al.) suggests the use of labeled bunch graphs for object detection. A labeled bunch graph models the local appearance at feature points using the response of Gabor wavelets and uses spring-like connections between the feature points to enforce an elastic shape constraint.
Methods have also been proposed to find the shape of objects using the global appearance of objects. The methods previously described use independent models of the local appearance at each feature point in order to perform matching at those points. However, methods based on the global appearance of an object use a model of the appearance across the entire object in order to simultaneously infer the locations of all feature points.
A popular method based on the global appearance of objects is described in Cootes et al., “Active Appearance Models,” Proc. European Conf. on Computer Vision 1998, H. Burkhardt and B. Neumann Eds., Vol. 2, pp. 484-498, 1998. As in the Active Shape Model technique, feature points are manually annotated on a set of ground truth images of an object. PCA is performed on the locations of these points to develop a compact parameterized shape model. The ground truth images are then warped to the average shape and the appearance across the entire object is analyzed using PCA. This generates a parameterized model of the global appearance of the object that is largely independent of shape. By varying the model parameters and using multivariate linear regression, the algorithm learns how to adjust the parameters of the models to match an object based upon the residual error. In order to find an object in an image, this matching process is repeated until convergence, after which the parameters of the shape model can be used to infer the locations of the feature points. This method is used for object classification, verification, and synthesis in WO Patent No. 01/35326 A1.
Various other techniques have also been proposed for finding feature points based on the global appearance of objects. In U.S. Pat. No. 5,774,129 (Poggio et al.) describe a method that uses interleaved shape and texture matching. A shape normalized appearance model is constructed as in the Active Appearance Model technique. Objects are found in an image by using optic flow to determine the shape transformation between the object and a prototype with average shape and appearance. The object is then warped to the average shape and its appearance is constrained to the limits of the appearance model. The constrained appearance then forms the new target for the optic flow alignment and the process repeats. After the process converges, the shape transformation can be used to infer the positions of feature points. In U.S. Pat. No. 6,188,776 (Covell et al.) proposes the use of a coupled affine manifold model. Given an aligned object, this model enables the positions of the feature points to be directly inferred. An appearance only model is suggested to initially align the object.
Methods that seek to find feature points using independent local models of appearance fail to take advantage of the coherent appearance at the feature points. For instance, within a given face there is a consistent hair and skin color that can be shared across numerous feature points. The appearance at a given feature point can be a strong indication of the correctness of the match at surrounding feature points. Methods that find feature points using models of appearance that are global across the entire object are able to take advantage of this coherence; however, global appearance models weight all positions within the object equally. Equal weighting ignores the fact that some areas of an object have higher information content about the shape of the object than do others. For instance, the edges around the eyes and border of the face convey more shape information than do the uniform areas on the cheeks and forehead. Methods that are based on global appearance preclude the sort of engineering decisions that are inherent in the local appearance methods. In the local appearance methods the algorithm designer must decide what areas of the object have the highest information content and place feature points at those positions in order to obtain an accurate result. What is needed is a method that both exploits the coherent appearance across an object and still enables special emphasis to be placed on selected positions on the object.