Processing of recognizing an object shown in an image or a state of the object is referred to as a “recognition task”. The recognition task includes, for example, processing of estimating an orientation of an individual (hereinafter, “posture”) from an image of the individual (for example, a face of a person) or identifying the individual. An example of the recognition task will be described using an example of posture estimation of an object. First, a three-dimensional position of a feature point on a three-dimensional shape model of an object is stored in advance. Generally, with a system in which a recognition target individual is newly registered after the system is activated, the feature point position is shared between all individuals. After the feature point is stored, the feature point position is detected from an image of the recognition target (an image showing the object for which the posture is estimated with the present example), and the three-dimensional position of the feature point stored in advance is associated. Further, the posture of the object is estimated based on the association between the position of the feature point in the recognition target image and the position of the feature point on the three-dimensional shape model. A method of estimating a posture of an object based on such an association is known as a solution of a perspective-n-point problem.
With the above processing, that a feature point on the three-dimensional shape model stored in advance is part of a recognition target and its position is easily specified in an image, and that the position of this site is important in the recognition task are both comprehensively taken into account, and are manually set. Manual setting of a feature point will be described in more detail. A recognition algorithm of executing a task such as posture estimation or individual identification can generally improve recognition performance when the number of feature points to be used is greater. However, when a great number of feature points are used, the computation amount of the recognition algorithm increases. Further, the computation amount for extracting feature points from a recognition target image increases. Hence, narrowing down the number of feature points is practically important. In order to improve the recognition performance in the recognition task using a small number of feature points, it is necessary to determine feature points satisfying the following conditions. The first condition requires that feature points are important in the recognition task (in other words, an influence on accuracy of the recognition algorithm is significant). Further, the second condition requires that feature points can be accurately extracted from an image. Generally, feature points satisfying both of the first condition and the second condition are manually determined from points on the three-dimensional shape model.
Non-Patent Literature 1 discloses, for example, a method of generating feature points based on an entropy as a method of automatically determining feature points on the three-dimensional model. However, this technique cannot narrow down the number of feature points to a useful small number of feature points for the recognition task taking both of the first condition and the second condition into account. Therefore, the definition of features points used for the recognition task is manually determined.
Further, feature points are extracted from a recognition target image by clipping each portion from the recognition target image, comparing each portion with a decision pattern learned in advance, and determining a position which is decided to be the most likely to be a feature point as a feature point position.
Non-Patent Literature 2 discloses a technique of extracting feature points required to find corresponding points between images according to a SIFT (Scale-Invariant Feature Transform) algorithm. The SIFT algorithm enables blob detection using multiresolution analysis, and association between images utilizing a histogram of a shading gradient. According to the SIFT algorithm, a feature amount of a feature point is calculated. By storing the feature amount of the decision pattern given in advance by way of learning and comparing with the feature amounts of the feature points extracted from the recognition target image, it is possible to decide whether or not points are feature points, and extract positions of the feature points.
Further, many pattern identifying methods can be utilized as a technique of making this decision. Non-Patent Literature 3 discloses, for example, Generalized Learning Vector Quantization (GLVQ). Although Non-Patent Literature 3 discloses detecting a pattern of a face, it is possible to detect feature points by changing the pattern from a face to feature points. Further, SVM (Support Vector Machine) is also known as a mechanical learning method.
How feature points are shown on a recognition target image change according to a posture of an object and lighting conditions. To correctly decide whether or not portions clipped from an image correspond to feature points, it is necessary to learn a decision pattern. Hence, multiple learning images of the object are captured under various conditions, and correct positions of feature points in these multiple learning images are manually input such that the decision pattern is learned.