Field of the Invention
The present invention relates in general to the fields of image processing, computer vision and pattern recognition, in particular to an object shape aligning apparatus, an object processing apparatus and methods thereof.
Description of the Related Art
In the fields of image processing, computer vision and pattern recognition, automatically and precisely aligning an object shape described by a set of feature points (or detecting feature points) is a critical task, and this can be widely used, for example, for face recognition, pose recognition, Expression analysis, 3D face modelling, face cartoon animation etc.
Current object shape aligning methods employ either a model-based approach (such as the Active Shape Model (ASM) and the Active Appearance Model (AAM)) or a regression-based approach (such as the Explicit Shape regression (ESR) and the Supervised Descent Method (SDM)).
Since object shape alignment is naturally a regression problem, regression-based approaches have achieved great progress in recent years. Regression-based approaches usually start by initializing an object shape, and then update the initial object shape to approach the ground truth. Differences between various regression-based approaches mainly lie in the feature extraction step and the regression shape increment prediction step.
Taking the SDM as an example. This method estimates the shape increment by minimizing a Non-linear Least Square (NLS) function. During training, the SDM. learns a sequence of descent directions that minimize the mean of NLS functions sampled at different points; and during aligning, the SDM minimizes the NLS objective by using the learned descent directions without computing either the Jacobian or the Hessian.
FIG. 1 schematically shows a flowchart of the SDM. Step 10 belongs to the training procedure, and steps 20 to 40 belong to the aligning procedure.
As shown in FIG. 1, first, at step 10, an object shape regression model, which comprises one regression function (or regressor), is acquired from a plurality of training samples.
Then, at step 20, an initial object shape for an object image is set.
Next, at step 30, one feature vector with respect to a plurality of feature points of the initial object shape is calculated.
More specifically, for example, Scale Invariant Feature Transform (SIFT) features are extracted from local image patches around the plurality of feature points to achieve a robust representation against illumination, and then the extracted SIFT features of the plurality of feature points are assembled into the one feature vector with respect to the plurality of feature points. FIG. 11 schematically shows extracted SIFT feature descriptors (i.e., structural illustration of extracted SIFT features) for three feature points (i.e., the outer eye corners of both eyes and the left mouth corner, which are located at the centers of respective local image patches). In FIG. 11, for example, SIFT features are extracted from an image patch of 4×4 grid around each feature point, and the dimensionality of the extracted SIFT features in each grid is 8. FIG. 12 schematically explains how to get the SIFT feature descriptors with respect to the encircled region of FIG. 11. In FIG. 12, each grid comprises 4×4 pixels for example, and in each pixel, an image gradient can be obtained and is shown as a vector (an arrow with a certain length and pointing to a certain direction). For each grid, a SIFT feature descriptor with a dimensionality of 8 can be obtained from the image gradients therein. FIG. 13 gives an enlarged view of obtained SIFT feature descriptors within the encircled region of FIG. 11, which correspond to the image gradients in FIG. 12. It can be seen from the above that, for each feature point, the dimensionality of the extracted SIFT features can be as high as 4×4×8=128, and thus for the one feature vector with respect to the plurality of feature points, its dimensionality can be as high as 128×(the number of feature points). This means that, in the SDM, the obtained feature vector comprises very rich features, yet has a very high dimensionality.
Finally, at step 40, for a plurality of coordinates of the feature points of the initial object shape, coordinate increments are predicted based on the obtained one feature vector and the one regression function.
For example, the SDM predicts the coordinate increments of the plurality of coordinates by projecting the one feature vector onto the learned one regression function (i.e., the learned descent directions). This may be represented by the following Expression (1):ΔS=F*Rt  (1)where ΔS represents the coordinate increments of the plurality of coordinates, F represents the obtained one feature vector with respect to the plurality of feature points, Rt represents the learned one regression function for a certain aligning process (i.e., the t-th aligning process), and the symbol “*” represents the projection or interaction (such as multiplication, dot product, or the like) of both sides. FIG. 14 gives a structural illustration of Expression (1). It is to be noted that, though F represents the assembled one feature vector with respect to the plurality of feature points, in FIG. 14, for simplicity, only the SIFT feature descriptors for 4 grids of 1 feature point is illustrated. It can be seen from the above that, the SDM employs one high dimensional feature vector comprising a plurality of features (i.e., a dense feature set) and one united regression function for the whole object shape to predict the coordinate increments of a plurality of coordinates.
Optionally, the aligning process in FIG. 1 can be repeated for several times (e.g., T times) so as to approach the ground truth of the object shape step by step (this is why the one regression function in Expression (1) has a superscript “t”). In other words, cascaded T regressors can be employed during aligning. FIG. 2 gives a schematic flowchart of a cascaded SDM. Its main steps are essentially the same as those of FIG. 1, and thus description thereof is omitted here.
However, the SDM has many limits.
First, since coordinates of the feature points on an object shape are often highly correlated, extracted features often have two or more highly correlated dimensions (known as multicolinearity). This makes it difficult to create an efficient regressor when the number of feature points increases (e.g., greater than 50), and thus makes the model training procedure unstable.
Second, such a method extracts rich features such as SIFT around each feature point and directly uses the features with thousands of dimensions (containing both useful and useless features) for the sake of getting a better prediction performance. This high dimensional feature vector is highly redundant to the aligning process, and thus makes the model size or dictionary size too big.
Third, due to the high dimensionality of the feature vector, such a method needs vast training samples during training to avoid the over-fitting problem.
Therefore, it is desired that a new object shape aligning apparatus, a new object processing apparatus and methods thereof, which are capable of dealing with at least one of the above problems, can be provided.