The ability to detect accurately features of deformable models is important for a wide range of image processing algorithms and applications. A widely used approach is to use a statistical shape model to regularise the output of independent feature detectors trained to locate each model point within two-dimensional (2D) or three-dimensional (3D) image data. One common example of this approach is based on Active Shape Models [1], in which a shape model is fitted to the results of searching around each model point with a suitably trained detector. Active Appearance Models (AAMs) [8] match combined models of shape and texture using an efficient parameter update scheme. Pictorial Structures [3] introduced an efficient method of matching part-based models to images, in which shape is encoded in the geometric relationships between pairs of parts.
Constrained Local Models [4, 5] build on a framework in which response images are computed estimating the quality of fit of each model point at each point in the target image. This then allows a shape model to be matched to the data, selecting the overall best combination of points. Note that a response image represents the spatial distribution of the value of the estimated quality of fit across a portion of the original image space.
Belhumeur et al. [9] have shown impressive facial feature detection results using sliding window detectors (SVM classifiers trained on SIFT features) combined with a RANSAC approach to select good combinations of feature points.
The task of a feature detector in such an approach is to compute a (pseudo) probability that a target point from a model occurs at a particular position in an acquired 2-dimensional (2D) or 3-dimensional (3D) image. This can be expressed as p(x|I), namely the probability that a given target point from the deformable model is located at position x, given the acquired image information I. (Where a technique returns a quality of fit measure, C, we assume that this can be converted into a pseudo-probability using a suitable transformation).
Local peaks in the function p(x|I) correspond to candidate positions, for example in an ASM. Another possibility, for example in CLMs and Pictorial Structures, is that the probabilities for each point are combined with the shape model information to find the best overall match. In this latter approach, the set of local probabilities regarding the positions of respective target points in the image are utilised in combination with the statistical shape model itself to determine a mapping of the deformable model onto the image, including a suitable deformation of the model, which has the highest overall global probability. In other words, this approach considers how likely it is that (a) a given portion of the image corresponds to a given target (based on the feature detector), and (b) the shape model as a whole can be deformed into a particular configuration of target points.
A wide variety of feature detectors have been used in such frameworks. These can be broadly classified into three types:
Generative in which generative models are used, so p(x|I)∝p(I|x).
Discriminative in which classifiers are trained to estimate p(x|I) directly.
Regression-Voting in which p(x|I) is estimated from accumulating votes for the position x of the point given information in nearby regions.
Although there has been a great deal of work matching deformable models using the first two approaches, the Regression-Voting approach has only recently begun to be explored in this context.
Regression based matching: One of the earliest examples of regression based matching techniques was the work of Covell [10] who used linear regression to predict the positions of points on the face. The original AAM [11] algorithm used linear regression to predict the updates to model parameters. Non-linear extensions include the use of Boosted Regression [12, 13] and Random Forest Regression [14]. The Shape Regression Machine [15] uses boosted regression to predict shape model parameters directly from the image (rather than the iterative approach used in AAMs). Zimmerman and Matas [16] used sets of linear predictors to estimate positions locally, an approach used for facial feature tracking by Ong and Bowden [17]. Dollár et al. [18] use sequences of Random Fern predictors to estimate the pose of an object or part.Regression based voting: Since the introduction of the Generalised Hough Transform [19], voting based methods have been shown to be effective for locating shapes in images, and there have been many variants of this approach. For instance, the Implicit Shape Model [20] uses local patches located on an object to vote for the object position, and Poselets [21] match patches to detect human body parts. Hough Forests [6] use Random Forest regression from multiple sub-regions to vote for the position of an object. This includes an innovative training approach, in which regression and classification training are interleaved to deal with arbitrary backgrounds and where only votes believed to be coming from regions inside the object are counted. This work has shown that objects can be effectively located in an image by pooling votes from Random Forest regressors. Valstar et al. [7] showed that facial feature points can be accurately located using kernel SVM based regressors to vote for each point position combined with pair-wise constraints on feature positions. Girshick et al. [22] showed that Random Forests can be used to vote for the position of joint centres when matching a human body model to a depth image. Criminisi et al. [26] use Random Forest regression to vote for the positions of the sides of bounding boxes around organs in CT images. Dantone et al. [23] have used conditional random forests to find facial features.