1. Field of the Invention
The present invention relates to an information processing apparatus enabling a pattern discriminator to learn and a method thereof.
2. Description of the Related Art
In recent years, in the field of pattern recognition, a method has drawn attention in which weak discriminators are cascade-connected to configure a pattern discriminator and to perform speedy detection processing on an object, such as a human face in an image. For example, in a method proposed by P. Viola and M. Jones, in “Rapid Object Detection using a Boosted Cascade of Simple Features”, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Vol. 1, pp. 511-518, December 2001, firstly, a predetermined number of weak discriminators which extract a rectangular feature are cascade-connected to configure a strong discriminator referred to as a stage.
This weak discriminator is generated by a boosting learning algorithm (refer to Japanese Patent Application Laid-Open No. 8-329031). Further, the above-described method proposes a pattern discriminator having a structure in which a plurality of stages are cascade-connected. Since processing proceeds while performing termination determination processing (termination processing on a detection target position in the image) in each stage, which is the strong discriminator, a further operation is terminated to be performed at an early point on an input determined to be the non-detection target. Thus, processing can be performed at high speed as a whole. A method for discriminating a pattern will be described in detail below.
As illustrated in FIG. 1, the pattern discriminator described in the above-described literature enables a rectangular region 801 having a certain specified size (hereafter, referred to as a “processing window”) to move within a face detection target image 800, which is a processing target, and to determine whether the processing window 801 includes a human face at each destination.
FIG. 2 illustrates a flow of face detection processing which is performed in the processing window 801 at each destination, as discussed in the above-described literature. The face detection processing in a certain processing window is performed in a plurality of stages. In each stage, weak discriminators having different combinations are allocated and cascade-connected to generate a strong discriminator.
Each weak discriminator detects a so-called Haar-like feature and includes a combination of rectangular filters. As illustrated in FIG. 2, each stage has a respective different number of weak discriminators. Each stage unit is configured by the cascade connection and performs the determination processing according to the order of connection. For example, in FIG. 2, a second stage follows a first stage for determination, and then a third stage follows the second stage therefor.
Each stage determines, according to an order assigned thereto, whether the processing window includes a human face by using the weak discriminator of the pattern assigned to each stage. In a certain stage, when it is determined that the processing window does not include a human face at a position, in following stages, the processing window does not perform the determination processing at that position (cascade processing is terminated). When the last determination processing determines that the processing window includes a human face, it is determined that the processing window includes a human face at this destination.
FIG. 3 is a flowchart illustrating an example of face detection processing. A flow of the face detection processing will be described more specifically with reference to FIG. 3.
In step S1001, in the face detection processing, the processing window 801, which is a processing target, is disposed on a face detection target image 800. Basically, as illustrated in FIG. 1, the processing window moves as scanning from an edge of the face detection target image 800 at a certain space in vertical and horizontal directions and selects a position to be processed. For example, the processing window is selected by raster scanning the face detection target image 800.
Subsequently, it is determined whether the processing window at the selected position includes a human face. The determination processing is performed using a plurality of stages as illustrated in FIG. 2. In step S1002, the stage for performing the determination processing is selected from the first stage in order.
In step S1003, the selected stage performs the determination processing. When it is determined in the selected stage that a summed score does not exceed a threshold value predetermined for each stage (NO in step S1004), then in step S1008, it is determined that the processing window does not include a human face, and the processing proceeds to step S1007 and following steps thereof. The processing in step S1007 and the following steps will be described below.
When the summed score exceeds the threshold value predetermined for each stage (YES in step S1004), it is determined whether the determination processing in step S1003 is performed in the last stage. When it is determined the determination processing in step S1003 is not performed in the last stage (NO in step S1005), the processing returns to S1002 and a following stage is selected to perform the determination processing by a newly selected stage. When it is determined that the determination processing in step S1003 is performed in the last stage (YES in step S1005), then in step S1006, it is finally determined that a current processing window includes a human face. At this point, this processing window is determined to include a human face.
Further, it is determined whether the processing window is the last processing window in the face detection target image. When the processing window is not determined to be the last processing window (NO in step S1007), the processing returns to step S1001, and the following processing window is selected to perform the processing of step S1002 and the following steps thereof. When the processing window is determined to be the last processing window, the face detection processing of the input image of the face detection target ends.
The processing of the determination for each stage will be described below.
The weak discriminators of one or more patterns are assigned for each stage. Boosting learning algorithm such as AdaBoost assigns the weak discriminator in leaning processing. Each stage determines whether the processing window includes a human face based on the weak discriminator of the pattern assigned to each stage.
In each stage, feature quantities are calculated in each of a plurality of rectangular regions in the processing window based on the weak discriminator having each pattern assigned to each stage. The feature quantity acquired herein is a total value or an average value of pixel values in each rectangular region, that is, a calculated value using a total value of the pixel values in the rectangular region. The total value in the rectangular region can be calculated at high speed by using summed area table information (referred to as “SAT” or “Integral Image”) of the input image.
FIGS. 4A and 4B illustrate an example of SAT. FIG. 4A illustrates an original input image. An upper left point is defined as an origin (0, 0). When a pixel value of a coordinate position (x, y) in the input image (FIG. 4A) is defined as the pixel value I (x, y), an element C (x, y) of a position in the coordinate position (x, y) of SAT is defined by equation (1) as below.
                              C          ⁡                      (                          x              ,              y                        )                          =                              ∑                                                            x                  ′                                ≤                x                                                              y                  ′                                ≤                y                                              ⁢                      I            ⁡                          (                                                x                  ′                                ,                                  y                  ′                                            )                                                          (        1        )            
More specifically, as illustrated in FIG. 4B, a total value of the pixels in the rectangle having the pixels of the origin (0, 0) and the position (x, y) as the opposing corners in the input image (FIG. 4A) is the value C (x, y) of the position (x, y) in SAT (FIG. 4B). A sum of an arbitrary pixel value I (x, y) in the rectangular region in the input image (FIG. 4A) can be acquired only by referring to four points of SAT (FIG. 4B).
For example, as illustrated in FIG. 5, when a total sum C (x0, y0 y1) of the pixel values in the rectangular region having (x0, y0) and (x1, y1) as the opposing corners is acquired by using equation (2) as below.C(x0,y0;x1,y1)=C(x0−1,y0−1)−C(x0−1,y1)−C(x1,y0−1)+C(x1,y1)  (2)
A difference value as a relative value of the calculated feature quantity (for example, ratio and difference value, herein, the difference value of the feature quantities is assumed to be calculated) is calculated, and it is determined whether the processing window includes a human face based on the difference value. More specifically, it is determined whether the calculated difference value is larger or smaller than the threshold value set for the weak discriminator of the pattern which is used for determination. According to the determination results, it is determined whether the processing window includes a human face.
However, the determination at this point is obtained based on the weak discriminator of each pattern but not based on the stage. As described above, in each stage, the determination processing is performed separately based on each of all weak discriminators having the assigned pattern to obtain the determination results.
The summed score in the stage is calculated. A reliability weight (score) is separately assigned to each weak discriminator of the pattern. The reliability weight refers to “certainty of determination” for a sole weak discriminator, which is a fixed value indicating the sole reliability.
When it is determined that the processing window includes a human face, the score corresponding to the weak discriminator of the pattern used for the determination is referred to and added to the summed score of the stage. As described above, the total sum of added individual scores is calculated as the summed score in the stage.
More specifically, this summed score is a value indicating the certainty of the determination (reliability for the entire stage). When the reliability for the entire stage exceeds a predetermined threshold (threshold of the entire stage reliability), it is determined that the processing window possibly includes a human face in this stage, and the processing continues to proceed to the following stage. When the reliability for the entire stage in this stage does not exceed the threshold value, it is determined that the processing window does not include a human face, and the processing terminates the following cascade processing.
The above-described literature realizes the speedy pattern identification, which is typical detection of the face. A detection unit illustrated in FIGS. 2 and 3 can be used as a pattern discriminator for objects other than faces if only appropriate learning is previously performed.
Further, Japanese Patent Application Laid-Open Nos. 2004-185611 and 2005-44330 also discuss a technique relating to a method and an apparatus for discriminating the pattern based on an idea of the above-described literatures. The pattern discriminator having the structure in which the weak discriminators are cascade-connected in one line as described above can provide a sufficient and speedy identification ability when a pattern (detection target pattern) similar to a face is separated from other patterns (non-detection target patterns) in the image.
However, for example, when the detection target pattern is a face image, even if the face keeps facing front, if the face tilts some ten degrees right or left (in-plane rotation), the face cannot be “very similar” to an original upright front face. Additionally, if the face is rotated in an axial direction in which the front face changes to a side face (depth rotation or depth rotation in the horizontal direction), the face becomes a two-dimensional image pattern, which is different from the original face.
It is impossible to identify the largely changing pattern by cascade-connecting in one line. The cascade-connecting structure of the weak discriminators is used for gradually eliminating a non-detection target pattern which is not similar to the detection target pattern to be identified. Thus, the patterns to be identified need to be very similar to each other.
When only the in-plane rotation is performed, if the input image is input while being rotated sequentially, the discriminator detecting the front face that is nearly upright can identify the face at any angle of 360 degrees. However, this method increases a processing time according to an increasing number of rotations. When the depth rotation is added, the discriminator cannot perform the processing.
Z. Zhang, L. Zhu, S. Z. Li, and H. Zhang “Real-Time Multi-View Face Detection”, Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition (FGR' 02), discusses a discriminator having a pyramid structure in a hierarchy state based on a Coarse to Fine strategy.
In this discriminator, in a first layer, a learning image pattern including all variations of a face direction to be detected is input to learn the stage. In a second layer, the variation of the face direction is divided into each predetermined range to learn a plurality of stages by the learning image patterns including the only divided variations.
In the following layer, the range is further divided into smaller ranges to learn the stages. As described above, as the layer is advanced, a strong discriminator (stage) having gradually decreased robustness is being constituted like a pyramid. This discriminator can only divide the variation of the face direction rotated by the depth rotation in the horizontal direction. The discriminator divides a ±90 degrees' range of the entire range by the depth rotation into three in the second layer and into nine in the third layer, but does not process the in-plane rotation.
When the detection processing is performed, if an input sub window passes the stage in the first layer, the stage in the second layer is sequentially performed. The sub window passes any one of the stages, and then proceeds to a following stage. As described above, the discriminator can detect the face patterns of all variations starting with a rough detection and performing gradually more accurate detections.
Japanese Patent Application Laid-Open No. 2005-284487 discusses a method for constituting the discriminator having a tree structure in which the detection units having the high robustness are gradually divided and a sub-window image is input into the detection unit having the lower robustness. This discriminator learns to process a part of the divided range of the variation which a parent node processes. The variation of the face direction in an exemplary embodiment of this method includes the depth rotation in the vertical direction in which the face moves up and down from a front position as well as the depth rotation in the horizontal direction.
After the detection processing of a first node including all depth rotations in the vertical and horizontal directions is performed, the variation of the face directions is divided into three, which are the front face and the two faces rotated right and left by the depth rotation. The faces are further divided in the vertical direction by the depth rotation in the following layer. Only the variation of the front face by the rotation in the vertical direction is divided in the following layer. The branch structure as described above is predetermined and a great number of pieces of sample data corresponding to each variation are input to learn the branches.
Unlike the literature by Zhang et al., the method discussed in Japanese Patent Application Laid-Open No. 2005-284487 does not need to perform an operation of the variation in a lower layer included in the variation in the terminated upper layer, and thus the speedy performance can be realized. The weak discriminator discussed in Japanese Patent Application Laid-Open No. 2005-284487 uses a pixel difference not the rectangular difference. However, Japanese Patent Application Laid-Open No. 2005-284487 and the literature by Zhang et al. share an idea that the weak discriminators constitute the strong discriminator by the cascade connection.
C. Huang, H. Ai, Y. Li, and S. Lao “Vector Boosting for Rotation Invariant Multi-View Face Detection”, Tenth IEEE International Conference on Computer Vision (ICCV2005), Volume 1, 17-21 Oct. 2005, pp. 446-453, discusses another leaning method of the discriminator having a similar tree structure to that of Japanese Patent Application Laid-Open No. 2005-284487.
The variations that the discriminator described in the above literature can process are the in-plane rotation and the depth rotation in the horizontal direction. From the node including all variations in the first layer, the depth rotation in the horizontal direction is divided into five in two stages, and then each of the rotational variations is further divided into three in the fourth layer. According to this structure, the learning proceeds similarly to that in the above-described literature.
Unlike the above-described literature, an output of the discriminator of each node to be learnt before reaching the final branch is not a scholar value but a vector value of the number of elements corresponding to the number of branches of the layer right down the node. More specifically, each node detector before the branch is generated has a function for selecting the branch for the following layer as well as terminating a non-face image. When detection is performed, only the branch corresponding to an element having a vector value of each node of nearly one is started up. Thus, an unnecessary operation does not have to be performed, thereby ensuring the speedy performance.
The literatures by Zhang et al., Huang et al., and Japanese Patent Application Laid-Open No. 2005-284487 determine the method for dividing the range of the variation by the Coarse to Fine strategy or the tree structure. For example, in the literature by Zhang et al., only the variation by the depth rotation in the horizontal direction can be divided, but not the in-plane rotation.
Japanese Patent Application Laid-Open No. 2005-284487 discusses the variation by the depth rotation in both of the horizontal and vertical directions to determine the structure such that the variation in the horizontal direction is divided in the upper layers, and then the variation in the vertical direction is divided in the lower layers. The literature by Hueng et al. divides the variation by the in-plane rotation after the variation by the depth rotation in the horizontal direction is divided.
Since these branch structures are experimentally (or, intuitively) determined by a human who performs machinery learning processing, the branch structure is not necessarily constituted best to identify the pattern including the variation to be identified. For example, in the literature by Huang, et al. described above, if the variation by the depth rotation is divided after the variation by the in-plane rotation is divided, the identification performance may be improved or a processing speed may be improved, since a ratio for terminating the branch which does not include an input pattern in an early point.
After various branch structures are checked to select the most appropriate branch structure, the structure having the best detection performance may be adopted. However, the mechanical learning processing is generally very time consuming, it is not realistic to perform all-play-all processing.