One of the major benefits of the increase in computational power has been a steady rise in the number of computer vision applications. Computer vision problems formerly impossible to solve in any reasonable amount of time have become more and more feasible.
Efficiently detecting and classifying objects in an image or video sequence is one of the main challenges of computer vision. Detection consists of giving a one-bit answer to the question “Is object/category x in the image?”.
Several machine learning approaches have been applied to this problem, demonstrating significant improvements in object detection accuracy and speed.
In addition, most often just establishing the presence/absence of objects is not enough and one desires to know also its exact locations in the image, or even independently detecting and localizing the parts of which the objects are composed.
As disclosed by P. Dollar and al. (“Cascaded Pose Regression”) IEEE Computer Vision and Pattern recognition 2010 pp. 1078-1085, in its simplest form, localization consists of identifying the smallest rectangular region of the image that contains the searched object but more generally, one wishes to recover the objects “shape” and more precisely with an accurate orientation (orientation being also known by using the term “pose”. Indeed, a changing of orientation/pose or of viewpoint leads to a completely different appearance of an object.
Shape refers to the geometric configuration of articulated objects (and the parts of which they are composed) for example the configuration of the limbs on a human body or the layout of a vehicle. More broadly, shape is any set of systematic and parameterizable changes in the appearance of the object.
To this purpose, landmarks estimation methods have been developed.
One of such landmarks estimation methods is the cascaded pose regression (CPR) technique as disclosed by P. Dollar, as cited above, also called shape estimation (where the term “shape” refers here to the set of landmarks locations characterizing the geometry of the face) and illustrated by FIG. 1 (disclosed by P. Dollar, as cited above) wherein each row 11, 12, 13 shows a test case culled from three different data sets.
More precisely, the cascaded pose regression (CPR) is formed by a series of T successive regressors R1 . . . T that start from a raw initial shape guess S0 (111) and progressively refine estimation, outputting final shape estimation ST (112). Shape S is represented as a series of P part locations Sp=[xp,yp], pϵ1 . . . P. When CPR is applied to facial landmarks detection, these parts correspond to facial landmarks. At each iteration, a regressor Rt takes as input a set of features computed on the face area in the current image and produces an update δS, which is then combined with previous iteration's estimate St-1 to form a new shape.
During learning, each regressor Rt is trained to attempt to minimize the difference between the true shape and the shape estimate of the previous iteration St-1. The available features at the input to Rt depend on the current shape estimate and therefore change in every iteration of the algorithm. Such features are known as pose-indexed or shape-indexed features. The key to the CPR technique lies on computing robust shape-indexed features and training regressors able to progressively reduce the estimation error at each iteration.
The robust cascaded pose regression (RCPR) is an algorithm derived from CPR and that deals with occlusions as disclosed by one the inventors, X. P. Burgos-Artizzu et al. (“Robust face landmark estimation under occlusion”), IEEE International Conference on Computer Vision, Sydney 2013. This method requires ground truth annotations for occlusion in the training set. So instead of defining a part location by only its x and y coordinates, a visibility parameter is added and can also be learned at the same time as the part locations.
Usually, such landmarks estimation methods are efficient when they are applied to a limited range of orientations/poses of the object around a reference “neutral” pose.
To succeed in additionally estimating the orientation/pose with accuracy, several approaches can be applied.
On the one hand, a first approach consists in applying two successive steps, one for estimating the orientation/pose of a given test image, followed by a second step for computing shape using a landmark estimation model, obtained by using one of the landmarks estimation methods similar to the one as described above, said model being learned during a training phase performed only on training images presenting a similar orientation/pose as the given test image.
On the other hand, a second approach consists in obtaining, during a training phase, a different landmark estimation model for each orientation/pose, using for each model an appropriate set of training images, testing all the resulting landmark estimation models on a given test image and selecting the best performing one on the basis of some automatic or semi-automatic measure.
The drawback of both approaches is that they result in a very high processing cost and are time-consuming, which is unrealistic when considering real-time applications.
Thus, there remains a significant need for automatically determining both the shape and the pose/orientation of an object in an image while reducing the processing time and costs.