One of the major benefits of the increase in computational power has been a steady rise in the number of computer vision applications. Computer vision problems formerly impossible to solve in any reasonable amount of time have become more and more feasible.
Efficiently detecting and classifying objects in an image or video sequence is one of the main challenges of computer vision. Detection consists of giving a one-bit answer to the question “Is object/category x in the image?”.
Several machine learning approaches have been applied to this problem, demonstrating significant improvements in object detection accuracy and speed.
In addition, most often just establishing the presence/absence of objects is not enough and one desires to know also their exact location in the image, or even independently detecting and localizing the parts of which the objects are composed.
As disclosed by P. Dollar and al. (“Cascaded Pose Regression”) IEEE Computer Vision and Pattern recognition 2010 pp 1078-1085, in its simplest form, localization consists of identifying the smallest rectangular region of the image that contains the searched object but more generally, one wishes to recover the objects “shape”.
Shape refers to the geometric configuration of articulated objects (and the parts of which they are composed) for example the configuration of the limbs on a human body or the layout of a vehicle. More broadly, shape is any set of systematic and parameterizable changes in the appearance of the object.
To this purpose landmark estimation methods have been developed and require the object to have been first correctly detected in a current image to test.
Among landmark estimation methods, the cascaded pose regression (CPR) technique as disclosed by P. Dollar, as cited above, is used for facial landmarks detection, also called shape estimation (where the term “shape” refers here to the set of landmarks locations characterizing the geometry of the face) as illustrated by FIG. 1 (disclosed by P. Dollar, as cited above) wherein each row 11, 12, 13 shows a test case culled from three different data sets.
More precisely, the cascaded pose regression (CPR) is formed by a series of T successive regressors R1 . . . T that start from a raw initial shape guess S0 (111) and progressively refine estimation, outputting final shape estimation ST (112). Shape S is represented as a series of P part locations Sp=[xp,yp], p∈1 . . . P. Typically these parts correspond to facial landmarks. At each iteration, a regressor Rt produces an update δS, which is then combined with previous iteration's estimate St-1 to form a new shape.
During learning, each regressor Rt is trained to attempt to minimize the difference between the true shape and the shape estimate of the previous iteration St-1. The available features depend on the current shape estimate and therefore change in every iteration of the algorithm, such features are known as pose-indexed or shape-indexed features and the key of the CPR technique lies on computing robust shape-indexed features and training regressors able to progressively reduce the estimation error at each iteration.
The robust cascaded pose regression (RCPR) is an algorithm derived from CPR and that deals with occlusions as disclosed by one the inventors, X. P. Burgos-Artizzu et al. (“Robust face landmark estimation under occlusion”), IEEE International Conference on Computer Vision, Sydney 2013. This method requires ground truth annotations for occlusion in the training set. So instead of defining a part location by only its x and y coordinates, a visibility parameter is added and can also be learned at the same time as the part locations. However, the CPR, or even the RCPR, techniques requires that an object has been beforehand correctly detected and located in a current image to test.
In other words, such detection establishes if an object is present or not in the image and provides the location of such object in the image.
Then estimating the shape is performed. Thus, according to the prior art, to detect an object and determine its shape, two successive steps must be implemented one after the other, requiring two different approaches (i.e. one per step).
Such implementation according to the prior art presents thus the drawback of slowing down the entire process to determine a shape of an object and is complex to implement since two different approaches, each one with its own parameters have to be taken into account.
Thus, there remains a significant need for improving the shape estimation results, while reducing the processing time.