This invention relates generally to image labeling, and more particularly, to a system and method for implementing automatic landmark labeling for a predetermined object class.
Image labeling for training data is an essential step in many learning-based vision tasks. There are at least two types of prior knowledge represented by image labeling. One is semantic knowledge, such as human IDs for face recognition, or an object's name for content-based image retrieval. The other is geometric/landmark knowledge. The position of an object (face/pedestrian/car) needs to be labeled for all training images, for example, in learning-based object detection. Each training image must be labeled with a set of landmarks which describe the shape of the face for supervised face alignment.
Geometric/landmark knowledge labeling is typically carried out manually. Practical applications, such as object detection, often require thousands of labeled images to achieve sufficient generalization capability. Manual labeling however, is labor-intensive and time-consuming. Furthermore, image labeling is an error-prone process due to labeler error, imperfect description of the objectives, and inconsistencies among different labelers.
Some notable and early work on unsupervised alignment denotes the process as congealing. The underlying idea is to minimize an entropy-based cost function by estimating the warping parameter of an ensemble. More recently, a least squares congealing (LSC) algorithm has been proposed which uses L2 constraints to estimate each warping parameter. These approaches estimate affine warping parameters for each image. The embodiments described herein estimate non-rigid shape deformation described by a large set of landmarks, rather than the relatively simple global affine transformation.
Additional work on unsupervised image alignment has incorporated more general deformation models, though not with the use of a well-defined set of landmarks by including a free-form B-spline deformation model. Bootstrapping algorithms to compute image correspondences and to learn a linear model based on optical flow and the use of an iterative Active Appearance Model (AAM) learning and fitting to estimate the location of mesh vertices, reporting results on images of the same person's face have also been developed. Further work formulates AAM learning as an EM algorithm and extends it to learning parts-based models for flexible objects. Other known techniques include 1) the use of a group-wise objective function to compute non-rigid registration, 2) improvements in manual facial land-mark labeling based on parameterized kernel PCA, 3) an MDL-based cost function for estimating the correspondences for a set of control points, and 4) alignment by tracking the image sequence with an adaptive template.
Generally, one cannot rely upon unsupervised learning methods to locate landmarks on physically meaningful features of an object, such as mouth/eye corners or nose tip on a face; while supervised facial alignment undesirably requires a large number of labeled training images to train a statistical model so that it can generalize and fit unseen images well.
It would be desirable to provide a system and method that automatically provides landmark labeling for a large set of images in a fashion that alleviates the foregoing problems.