One of the major benefits of the increase in computational power has been a steady rise in the number of computer vision applications. Computer vision problems formerly impossible to solve in any reasonable amount of time have become more and more feasible.
Efficiently detecting and classifying objects in an image or video sequence is one of the main challenges of computer vision. Detection consists of giving a one-bit answer to the question “Is object/category x in the image?”.
Several machine-learning approaches have been applied to this problem, demonstrating significant improvements in object detection accuracy and speed.
In addition, most often just establishing the presence/absence of objects is not enough and one desires to know also its exact locations in the image, or even independently detecting and localizing the parts of which the object are composed.
As disclosed by P. Dollar et al. (“Cascaded Pose Regression”) IEEE Computer Vision and Pattern recognition 2010 pp 1078-1085, in its simplest form, localization consists of identifying the smallest rectangular region of the image that contains the searched object but more generally, one wishes to recover the object's “shape”.
Shape refers to the geometric configuration of articulated objects (and the parts of which they are composed), for example the configuration of the limbs on a human body or the layout of a vehicle. More broadly, shape is any set of systematic and parameterizable changes in the appearance of the object.
To this purpose landmarks estimation methods have been developed. Among such methods, the cascaded pose regression (CPR) technique as disclosed by P. Dollar, as cited above, is used for facial landmarks detection, also called shape estimation (where the term “shape” refers here to the set of landmarks locations characterizing the geometry of the face).
More precisely, the cascaded pose regression (CPR) is formed by a cascade of T regressors R1 . . . T that start from a raw initial shape guess S0 and progressively refine estimation, outputting final shape estimation ST. Shape S is represented as a series of P part locations Sp=[xp,yp], pϵ1 . . . P. Typically these parts correspond to facial landmarks. At each iteration, a regressor Rt produce an update δS, which is then combined with previous iteration's estimate St-1 to form a new shape.
During learning, each regressor Rt is trained to attempt to minimize the difference between the true shape and the shape estimate of the previous iteration St-1. The available features depend on the current shape estimate and therefore change in every iteration of the algorithm, such features are known as pose-indexed or shape-indexed features and the key of the CPR technique lies on computing robust shape-indexed features and training regressors able to progressively reduce the estimation error at each iteration.
The robust cascaded pose regression (RCPR) is an algorithm derived from CPR and that deals with occlusions as disclosed by one the inventors, X. P. Burgos-Artizzu et al. (“Robust face landmark estimation under occlusion”), IEEE International Conference on Computer Vision, Sydney 2013. This method requires ground truth annotations for occlusion in the training set. So instead of defining a part location by only its x and y coordinates, a visibility parameter is added and can also be learned at the same time as the part locations.
However, the CPR, or even the RCPR, techniques do not always succeed in correctly estimating the object's shape, especially when dealing with very challenging faces, in terms of pose and occlusions.
Currently, such object shape estimation failures need to be detected manually by an operator, which is a tedious and time-consuming process.
Thus, there remains a significant need for automatically classifying the results provided by automatic shape estimation methods into good or bad results.