1. Field of the Invention
The present invention relates to an information processing apparatus and method, a recording medium and a program, and particularly to an information processing apparatus and method, a recording medium and a program capable of quickly detecting an object of interest, such as a face image, with a small amount of computation.
2. Description of the Related Art
There have been proposed numerous face detection apparatuses of related art that do not use any motion in a complex image scene but use only a grayscale pattern of an image signal.
FIG. 1 shows an exemplary configuration of a face detection apparatus 1 of related art (see JP-A-2005-284487). The face detection apparatus 1 outputs the position and size of a face indicative of the area of the face in a given image (input image).
An image output section 2 of the face detection apparatus 1 supplies a grayscale image (brightness image) as the input image inputted to the face detection apparatus 1 to a scaling section 3.
The scaling section 3 enlarges or reduces the input image supplied from the image output section 2 to a specified scale and outputs the resultant scaled image to a scanner 4.
Specifically, the input image 10A shown in FIG. 2 is first outputted to the scanner 4 as it is. The input image 10A undergoes processes performed in the scanner 4 and a discriminator 5, which will be described later, and then an input image 10B is generated by reducing the size of the input image 10A. The input image 10B undergoes the processes performed in the scanner 4 and the discriminator 5, and then an input image 10C obtained by further reducing the size of the input image 10B is outputted. Similarly, further reduced images, such as 10D and 10E, are sequentially generated, and this process is terminated when the image size of the reduced image becomes smaller than the size of a window that the scanner 4 moves in the scan operation. Upon the termination of this process, the image output section 2 outputs the next input image to the scaling section 3.
The scanner 4 uses the window of a predetermined size to sequentially scan the scaled input image supplied from the scaling section 3 in the direction, for example, from the upper left to the lower right, and outputs the image in the window as a window image.
Specifically, as shown in FIG. 3, the window 11 having the same size as the window size that the subsequent discriminator 5 accepts is sequentially applied to the whole part (screen) of the given image, such as the image 10A, and the image in the window 11 at each position (hereinafter referred to as a cutout image or a window image) is outputted to the discriminator 5.
The window 11 is moved in the scan operation on a pixel basis as shown in FIG. 4. That is, after the cutout image in the window 11 at a predetermined position is outputted from the scanner 4, in the next scan, the window 11 moves rightward by one pixel and the cutout image in the window 11 at that position is supplied to the discriminator 5.
Although the window size is fixed, the scaling section 3 sequentially reduces the input image, so that the image size of the input image is converted into various scales as described above, allowing detection of an object of an arbitrary size.
That is, a face of any size in the input image is sequentially reduced and the image size eventually becomes substantially the same as the window size. As a result, it can be detected whether or not the image in the window 11 is a human face image.
The discriminator 5 refers to the learning result from a group learner 6 that performs group learning on a plurality of weak discriminators that form the discriminator 5 so as to discriminate whether each window image successively moved by the scanner 4 in the scan operation is a face image (object of interest) or an image other than a face image (object of no interest).
As shown in FIG. 5, the discriminator 5 includes a plurality of weak discriminators 21i (i=1, 2, 3, . . . , K) obtained by ensemble learning and an adder 22 that multiplies the outputs (discrimination results) from the weak discriminators by respective corresponding weights αi (i=1, 2, 3, . . . , K) to determine a weighted majority decision F(x).
Each of the weak discriminators 211 to 21K discriminates whether or not the image in the window 11 is a human face image based on two pixels at arbitrary positions among the pixels in the window 11. The reference character K corresponds to the number of combinations of two pixels extractable from the image in the window 11.
Specifically, difference in brightness between the two pixels (hereinafter referred to as an inter-pixel difference feature) is used as the amount of feature for the purpose of discrimination. The amount of feature of the window image is compared with the amount of feature learned by using learning samples (threshold value) formed of a plurality of grayscale images that have been labeled in advance as the object of interest or an object of no interest, so as to successively output an estimated value f(x) for estimating whether or not the window image is the object of interest in a deterministic or probabilistic manner.
For the weak discriminator that outputs a binary value, as in AdaBoost, for example, a threshold value is used to binarize the inter-pixel difference feature so as to discriminate between the object of interest and an object of no interest. Alternatively, as in Real-AdaBoost, for example, the inter-pixel difference feature may be used to output a continuous value indicative of the likelihood of being the object of interest in a probabilistic manner.
What the weak discriminators 21i need, such as two pixels and the amount of feature (threshold value) for the purpose of discrimination, is learned by the group learner 6 in the learning process according to a predetermined algorithm.
The plurality of the weak discriminators are successively generated by the group learner 6 that uses the above learning samples to perform group learning according to an algorithm, which will be described later. The plurality of the weak discriminators calculate the above estimated values, for example, in the order the weak discriminators are generated.
The adder 22 multiplies the estimated value of each weak discriminator 21i by the weight, which is reliability for each weak discriminator 21i, sums the weighted values and outputs the summed value (the value of the weighted majority decision).
The weight (reliability) for the weighted majority decision is learned by the group learner 6 in the learning process in which the weak discriminators are generated.
In the discriminator 5, as described above, each weak discriminator 21i successively outputs the estimated value f(x) indicative of whether or not the input window image is a face, and the adder 22 calculates and outputs the weighted majority decision F(x). According to the value of the weighted majority decision F(x), judgment means (not shown) makes final judgment of whether or not the window image is the object of interest.
During the weighted majority decision process, the discriminator 5 does not wait for the calculation results from all weak discriminators but terminates the calculation even in the course thereof when it is judged by the calculated value at that point that the window image is not the object of interest.
Specifically, whenever the plurality of weak discriminators generated in advance in the learning process output estimated values, the estimated values are multiplied by the weights for respective weak discriminators obtained in the learning process and the weighted values are summed to update the value of the weighted majority decision. Whenever the value of the weighted majority decision (evaluation value) is updated, an abort threshold value is used to control whether or not to abort the calculation of estimated values. The abort threshold value (reference value) is learned by the group learner 6 in the learning process.
The discriminator 5 has the weak discriminators that use the significantly simple amount of feature, that is, the difference in brightness between two pixels (inter-pixel difference feature), to discriminate between the object of interest and an object of no interest, so that the speed of detecting the object of interest can be increased. On the other hand, the abort process is used to proceed to the discrimination process for the next window image in the course of the calculation without waiting for the calculation results from all weak discriminators, so that the amount of computation in the detection process can be significantly reduced. These effects allow even faster processing.
Furthermore, in the discriminator 5, a plurality of the weak discriminators form a node and a plurality of the nodes are disposed in a tree structure.
FIG. 6 shows an example of the tree structure formed when all the following images are learned as learning samples; an image labeled as a frontward face (an image with a yaw angle ranging from −15 degrees to +15 degrees) (hereinafter referred to as the label W11 image), an image labeled as a leftward face (an image with a yaw angle ranging from +15 degrees to +65 degrees) (hereinafter referred to as the label W21 image), an image labeled as a rightward face (an image with a yaw angle ranging from −65 degrees to −15 degrees) (hereinafter referred to as the label W31 image), an image labeled as a face rotated in the rolling direction from the front position by +20 degrees (hereinafter referred to as the label W12 image), an image labeled as the leftward face rotated in the rolling direction by +20 degrees (hereinafter referred to as the label W22 image), an image labeled as the rightward face rotated in the rolling direction by +20 degrees (hereinafter referred to as the label W32 image), an image labeled as a face rotated in the rolling direction from the front position by −20 degrees (hereinafter referred to as the label W13 image), an image labeled as the leftward face rotated in the rolling direction by −20 degrees (hereinafter referred to as the label W23 image) and an image labeled as the rightward face rotated in the rolling direction by −20 degrees (hereinafter referred to as the label W33 image).
As shown in FIG. 7A, the yaw angle is the angle centered on the axis 202 that is perpendicular to the axis 201 parallel to the line connecting the human eyes and passing substantially the center of the human head and that vertically passes substantially the center of the human head. The yaw angle is defined to be negative in the right direction, while positive in the left direction.
The rolling angle represents the angle that rotates around the axis 203 perpendicular to the axes 201 and 202, and is defined to be zero degree when the axis 201 is horizontal.
There is another angle called a pitch angle, which represents the attitude. The pitch angle is the upward and downward angle centered on the axis 201, and is defined to be positive in the upward direction, while negative in the downward direction.
When the tree structure shown in FIG. 6 is not used, for example, a group of weak discriminators 231 shown in FIG. 8 may be required in order to identify one label. The group of weak discriminators 231 has K weak discriminators, that is, weak discriminators 21-11 to 21-1K. These K weak discriminators learn a learning sample of one label.
Therefore, to learn learning samples of, for example, the nine labels, that is, the labels W11 to W13, the labels W21 to W23 and the labels W31 to W33, it is necessary to prepare a group of weak discriminators 231-1 for learning the leaning sample of the label W11 as well as groups of weak discriminators 231-2 to 231-9 for learning the learning samples of respective labels, that is, the label W12, the label W13, the labels W21 to W23 and the labels W31 to W33, as shown in FIG. 9. Each of these groups of weak discriminators 231-2 to 231-9 also has K weak discriminators.
In the tree structure shown in FIG. 6, the number of weak discriminators in a learning path from the most upstream node to the most downstream node is also K at the maximum. However, as described above, when the value of the weighted majority decision; which is the value obtained by weighting the value resulting from the process performed in each weak discriminator in the discrimination (identification) process and summing the weighted values, is smaller than the abort threshold value, the discrimination (identification) process is aborted at that point. Therefore, the number of weak discriminators can be reduced.
FIG. 10 diagrammatically shows the above explanation. That is, in this embodiment, although a node 221 is basically formed of weak discriminators 211 to 21100, each weak discriminator 21i has an abort capability based on the abort threshold value, as shown in FIG. 11. In the figures, reference character Y indicates that the subsequent stage takes over the output, while reference character N indicates that the process is aborted there.
Thus, the discriminator 5 functions as judgment means for not only calculating the weighted majority decision as an evaluation value for judging whether or not the window image is the object of interest but also judging whether or not the window image is the object of interest based on the evaluation value.
The group learner 6 uses group learning to learn in advance weak discriminators, weights to be multiplied to the outputs (estimated values) of the weak discriminators and the like.
As group learning, any specific approaches can be applied as far as they can determine the result of a plurality of weak discriminators in a majority decision process. For example, group learning using boosting, such as AdaBoost in which data are weighted to make a weighted majority decision, can be applied.