1. Field of the Invention
The present invention relates to a technique for identifying and extracting a specific data pattern included in image data, audio data, and the like.
2. Description of the Related Art
In recent years, in the field of pattern recognition, a method of configuring an identification unit by cascade-connecting weak discriminators, and executing processing for detecting a specific object such as a human face in an image and the like at high speed has received attention.
For example, in a method disclosed by Viola and Jones in P. Viola and M. Jones, “Rapid Object Detection using a Boosted Cascade of Simple Features”, Proc. IEEE Conf. on Computer Vision and Pattern Recognition, Vol. 1, pp. 511-518, December 2001 (to be referred to as “non-patent reference 1” hereinafter), weak discriminators of the predetermined number, which extract rectangular features, are cascade-connected to configure a strong discriminator called a stage. This weak discriminator is generated by a boosting learning algorithm (disclosed in Japanese Patent Laid-Open No. 8-329031 and the like). Furthermore, a pattern identification unit with the configuration obtained by cascade-connecting a plurality of such stages has been proposed. By advancing processing while making abort determination (end of processing for a certain detection target position in an image) for each strong discriminator, since subsequent calculations for an input, which is determined early that it is not a detection target, are aborted, high-speed processing can be executed as a whole. This pattern identification method will be described in detail below.
The pattern identification unit of non-patent reference 1 moves a rectangular region (processing window 801) having a specific size within an image 800 to be processed, and checks if the processing window 801 at each moving destination includes a human face, as shown in FIG. 8.
FIG. 9 is a view showing the sequence of face detection processing executed in non-patent reference 1 in the processing window 801 at each moving destination position. The face detection processing in a certain processing window is executed using a plurality of stages. Weak discriminators of different combinations are assigned to the respective stages, and are processed by cascade connection to serve as strong discriminators. Each weak discriminator detects a so-called Haar-like feature, and comprises a combination of rectangular filters. As shown in FIG. 9, the numbers of weak discriminators assigned to the respective stages are different. Respective stages also have the cascade-connected configurations, and execute determination processing in an order they are connected. That is, in, for example, FIG. 9, the second stage executes the determination processing after the first stage, and the third stage then executes determination.
Each stage checks using weak discriminators of assigned patterns in turn if a processing window includes a human face. If it is determined in a certain stage that the processing window does not include any human face, the subsequent stages do not execute the determination processing for the processing window at that position. That is, the cascade processing is aborted. If it is determined in the final stage that a human face is included, determination that the processing window at that position includes a human face is settled.
The sequence of the face detection processing will be described in detail below with reference to the flowchart of FIG. 10.
In the face detection processing, the processing window 801 to be processed is allocated on a face detection target image 800 (step S1001). Basically, this processing window is comprehensively moved and selected by scanning from the end of the face detection target image 800 at predetermined intervals in turn in the vertical and horizontal directions, as shown in FIG. 8. For example, by raster-scanning the face detection target image 800, the processing window is selected.
The determination processing as to whether or not the selected processing window includes a human face is executed for that processing window. This determination processing is executed using a plurality of stages, as described above using FIG. 9. For this reason, a stage that executes the determination processing is selected in turn from the first stage (step S1002).
The selected stage executes the determination processing (step S1003). In the determination processing of this stage, if an accumulated score (to be described later) does not exceed a threshold determined in advance for each stage (NO in step S1004), it is determined that the processing window does not include any human face (step S1008), and the processes in step S1007 and subsequent steps are executed. The processes in step S1007 and subsequent steps will be described later.
On the other hand, if the accumulated score exceeds the threshold determined in advance for each stage (YES in step S1004), it is determined if that determination processing (that in step S1003) is executed by the final stage. If that determination processing is executed not by the final stage (NO in step S1005), the process returns to step S1002 to select the next stage, and to execute the determination processing by the newly selected stage. On the other hand, if the determination processing is executed by the final stage (YES in step S1005), it is finally determined that the current processing window includes a human face (step S1006). At this timing, it is determined that this processing window includes a human face.
It is then determined if the processing window that has undergone the determination processing is a last processing window in the face detection target image. If the processing window is not the last processing window (NO in step S1007), the process returns to step S1001 to select the next processing window and to execute the processes in step S1002 and subsequent steps. On the other hand, if the processing window is the last processing window, the face detection processing for this input image as a face detection target ends.
The processing contents of determination in each stage will be described below.
Weak discriminators of one or more patterns are assigned to each stage. This assignment is executed by an ensemble learning algorithm such as AdaBoost or the like in learning processing. Each stage determines based on the weak discriminators of patterns assigned to itself if the processing window includes a face.
In each stage, feature amounts in a plurality of rectangular regions in the processing window are calculated based on the weak discriminators of the patterns assigned to that stage. The feature amount used in this case is a value calculated using the sum total value of pixel values in each rectangular region (sum total value in a rectangular region) such as the total value, average value, and the like of the pixel values in the rectangular region. The sum total value in the rectangular region can be calculated at high speed using accumulated image information (called a Summed Area Table (SAT) or Integral Image) for an input image.
FIGS. 11A and 11B are views for explaining an example of the SAT. FIG. 11A shows an original input image 1101, and has the upper left corner as an origin (0,0). Letting I(x, y) be a pixel value at a coordinate position (x, y) of the input image 1101, a component C(x, y) at the same position (x, y) of the SAT is defined as:
                              C          ⁡                      (                          x              ,              y                        )                          =                              ∑                                                            x                  ′                                ≤                x                                                              y                  ′                                ≤                y                                                                                    ⁢                      I            ⁡                          (                                                x                  ′                                ,                                  y                  ′                                            )                                                          (        1        )            
As shown in FIG. 11B, the sum total value of pixels in a rectangle defined by the origin position (0,0) and a position (x, y) as diagonal positions on the input image 1101 is a value C(x, y) at the position (x, y). The sum of pixel values I (x, y) in an arbitrary rectangular region on the input image 1101 can be calculated with reference to four points shown in, for example, FIG. 12 using:C(x0, y0; x1, y1)=C(x0−1, y0−1)−C(x0−1, y1)−C(x1, y0−1)+C(x1, y1)   (2)
As a relative value of the calculated feature amounts (for example, a ratio or difference value; assume that a difference value of feature amounts is calculated in this case), the difference value is calculated, and it is determined based on this difference value if the processing window includes a human face. More specifically, whether the calculated difference value is larger or smaller than a threshold set in a weak discriminator of a pattern used in determination is determined. According to this determination result, whether or not the processing window includes a human face is determined.
However, the determination at this time is made based on an individual weak discriminator of each pattern, but it is not the determination of the stage. In this manner, in each stage, the determination processes are individually executed based on the weak discriminators of all the assigned patterns, and the respective determination results are obtained.
Next, an accumulated score in that stage is calculated. Individual reliability weights (scores) are assigned to weak discriminators of respective patterns. The reliability weight is a fixed value indicating “probability of determination”, that is, an individual reliability. If it is determined that the processing window includes a human face, a score assigned to the weak discriminator of the pattern used at that time is referred to, and is added to the accumulated score of the stage. In this way, the sum total of scores individually added is calculated as the accumulated score of that stage. That is, this accumulated score is a value indicating the probability of determination in that stage as a whole (whole stage reliability). If the whole stage reliability exceeds a predetermined threshold (whole stage reliability threshold), it is determined in this stage that the processing window is likely to include a human face, and the process is continued to advance to the next stage. On the other hand, if the whole stage reliability in this stage does not exceed the threshold, it is determined that the processing window does not include any human face, and the subsequent cascade processing is aborted.
In non-patent reference 1, high-speed pattern identification represented by face detection is implemented in such sequence. Note that a detector in FIGS. 9 and 10 can be used as a pattern identification unit for patterns other than faces if it has undergone appropriate learning in advance.
Japanese Patent Laid-Open No. 2004-185611 and Japanese Patent Laid-Open No. 2005-044330 have disclosed inventions associated with a pattern identification method and apparatus based on the idea of non-patent reference 1. A pattern identification unit having a structure in which such weak discriminators are cascade-connected in line exerts fast and sufficient identification performance upon separating a look-alike pattern (detection target pattern) and other patterns (non-detection target patterns) especially from an image.
However, when a detection target pattern is, for example, a face image, if a pattern inclines to the left or right through about several ten degrees (to be referred to as in-plane rotation hereinafter) although it is kept looking in a frontal direction, it is not a “look-alike” pattern with respect to an original erected frontal face. Furthermore, if a pattern is rotated in the axial direction (to be referred to as depth rotation or lateral depth rotation hereinafter) to be close to a side face, it becomes a quite different two-dimensional image pattern. It is primarily impossible to identify patterns with such large variations by the cascade connection in line, resulting in an increase in processing time and deterioration of detection precision. Since the cascade connection structure of weak discriminators aims at excluding little by little non-detection target patterns unlike a detection target pattern to be identified, use of look-alike patterns to be identified is premised.
In case of in-plane rotation alone, when an input image is input to an identification unit that detects a frontal face close to an erected image after it is sequentially rotated little by little, faces at every angles of 360° can be identified. However, with this method, the processing time increases according to the number of times of rotation, and that method cannot cope with depth rotation if it is applied.
Hence, Z. Zhang, L. Zhu, S. Z. Li, and H. Zhang, “Real-Time Multi-View Face Detection”, Proceedings of the Fifth IEEE International Conference on Automatic Face and Gesture Recognition (FGR'02) (to be referred to as “non-patent reference 2” hereinafter), proposes an identification unit having a hierarchical pyramid structure based on the Coarse to Fine approach. With this identification unit, in the first hierarchy, learning image patterns including all face-view variations to be detected are input to learn one stage. In the second hierarchy, face-view variations are divided for respective predetermined ranges, and a plurality of stages are learned using learning image patterns including only variations within the divided ranges. In the next hierarchy, variations are further divided into narrower ranges, and a plurality of stages, the number of which is further increased, are learned. In this manner, the number of strong discriminators (stages) whose robustness is gradually lowered is gradually increased as the hierarchies progress, thus configuring a pyramid-like structure. Note that the identification unit in this reference supports rotation by dividing only face-view variations of lateral depth rotation. A full depth rotation range of ±90° is divided into three ranges in the second hierarchy, and is divided into nine ranges in the third hierarchy, but an in-plane rotation range is not divided.
In the detection processing of this identification unit, if an input sub-window has passed the stage of the first hierarchy, it executes respective stages of the second hierarchy in turn. If the sub-window has passed any one of the stages, it advances to the stages of the next hierarchy. In this manner, since the identification unit begins with coarse detection and then proceeds to detections with gradually higher precision levels, the identification unit that can detect face patterns of all variations with high precision is configured.
Japanese Patent Laid-Open No. 2005-284487 also discloses a method of configuring an identification unit having a tree structure which gradually branches from a detector with large robustness into those with lower robustness, based on the same idea. In this identification unit, one node (stage) of each arm of a tree is learned to cover a partial variation range obtained by dividing a variation range to be covered by a parent node. Face-view variations supported by an embodiment disclosed in Japanese Patent Laid-Open No. 2005-284487 include not only those of lateral depth rotation but also those of longitudinal depth rotation in which a face turns up or down from a frontal face. A person empirically determines the numbers of weak discriminator stages of respective nodes to execute learning.
In detection processing, after the detection processing of the first node including all longitudinal and lateral depth rotation variations is executed, the process branches to three variations, that is, a frontal face and depth-rotation faces in the right and left directions. In the next hierarchy, the process further branches into three different longitudinal depth rotation variation ranges. Only longitudinal rotation central variations of a frontal face further branch to three ranges in the next hierarchy. After such branch structure is determined in advance, many sample data corresponding to respective variations are input to make respective branches learn. Unlike in non-patent reference 2, since calculations of lower hierarchy nodes included in variations aborted at a higher hierarchy node need not be executed, high-speed processing can be implemented. Note that a weak discriminator in Japanese Patent Laid-Open No. 2005-284487 uses a pixel difference in place of a rectangular difference. However, the idea that a strong discriminator is configured by cascade-connecting weak discriminators remains the same.
C. Huang, H. Ai, Y. Li, and S. Lao, “Vector Boosting for Rotation Invariant Multi-View Face Detection”, Tenth IEEE International Conference on Computer Vision (ICCV2005), Volume 1, 17-21 October 2005, pp 446-453 (to be referred to as “non-patent reference 3” hereinafter) proposes another learning method of an identification unit having a tree structure similar to Japanese Patent Laid-Open No. 2005-284487. Variations supported by the identification unit described in this reference include in-plane rotation and lateral depth rotation variations. This reference defines a structure in which lateral depth rotation variations are branched from a node of the first hierarchy into five ranges in two hierarchies, and after that, rotation variations are branched into three ranges in the fourth hierarchy. Making learning according to this structure is the same as the above reference. Also, a person empirically determines the numbers of weak discriminator stages of respective nodes to execute learning in the same manner as in the above reference.
However, outputs of discriminators of each node learned before the last branch is reached are not scalar values but a vector value having the number of elements equal to the number of branches of a hierarchy next to that node, unlike in the above reference. That is, each node detector before branching has not only a function of aborting a non-face image but also a function of making branch selection of the next hierarchy. Upon detection, since only branches corresponding to elements whose output vector value is close to 1 of each node are launched, unnecessary calculations need not be made, thus guaranteeing high-speed processing.
In the related arts such as non-patent references 2 and 3 and Japanese Patent Laid-Open No. 2005-284487, a division method (i.e., a branch structure) of variation ranges based on the Coarse to Fine approach or tree structure is determined prior to learning. The numbers of weak discriminator stages of respective divided nodes (stages) are the predetermined numbers of stages which are empirically (or intuitively) determined by a person who conducts machine learning processing. For example, Japanese Patent Laid-Open No. 2005-284487 determines that the number of weak discriminators of an arm node of each branch is 100. Also, non-patent reference 2 generates vector output weak discriminators one by one (i.e., of T stages) by repetitive processing of T times.
The number of weak discriminator stages which is, for example, empirically determined by a person is not always optimal. In the pattern identification unit having the branch structure (or pyramid structure), patterns to be identified have smaller robustness in nodes in later stages. Therefore, the number of processing stages (i.e., speed) and precision required to separate target patterns from other patterns such as a background and the like improve toward later stages. It is considered that the number of processing stages empirically determined in the related arts suffices to determine whether or not processing of a node of a subsequent stage starts, but it is not the minimum required number of processing stages. Since robustness becomes lower toward the later stages, it is expected to improve the processing speed when a node after the last branch is reached as early as possible. However, in the above related arts, there is no means for discriminating the minimum required number of processing stages in each branch node.