Conventionally, in the fields of image recognition and speech recognition, a technique for detecting an object to be recognized from an image including this object and a background by implementing a recognition processing algorithm specialized to a specific object to be recognized by computer software or hardware using a dedicated parallel image processing processor is known.
Especially, as a technique for detecting a face as the specific object to be recognized, Japanese Patent Laid-Open No. 9-251534 discloses a technique that searches an input image for a face region using a template called a standard face, and then applies partial templates to feature point candidates such as eyes, nostrils, mouth, and the like to authenticate a person. Japanese Patent No. 2767814 discloses a technique which obtains eye and mouth candidate groups from a face image, and collates face candidate groups formed by combining these groups with a pre-stored face structure to find regions corresponding to the eyes and mouth. Furthermore, Japanese Patent Laid-Open No. 9-44676 discloses a technique that obtains a plurality of eye, nose, and mouth candidates, and detects a face on the basis of the positional relationship among feature points, which are prepared in advance.
Also, Japanese Patent No. 2973676 discloses a technique that changes shape data upon checking matching levels between shape data of respective parts of a face and an input image, and determines search regions of respective face parts based on the previously obtained positional relationship of parts. Japanese Patent Laid-Open No. 11-283036 discloses a technique that recognizes a face by moving a region model set with a plurality of judgment element acquisition regions in an input image to judge the presence/absence of each judgment element within each of these judgment element acquisition regions.
As techniques for detecting a rotated object, those which are disclosed in Japanese Patent Laid-Open No. 11-15973 and “Rotation Invariant Neural Network-Based Face Detection” (H. Rowley, T. Kanade, CVPR98, p 38-44) are known. The former technique applies polar conversion to an object from its central coordinate position to cope with rotation of the object, so as to transform rotation into shift, thereby detecting rotation. The latter technique prepares a neural network (to be abbreviated as “NN” hereinafter) that detects the rotation angle of a face as a pre-stage of face detection, rotates an input image in accordance with the output angle of that NN, and inputs the rotated input image to the NN that implements face detection.
However, pattern detection using the aforementioned prior arts suffers the following problems.
That is, the technique described in Japanese Patent Laid-Open No. 9-251534 is vulnerable to a plurality of face sizes and a change in face direction, since the standard face is initially used to match the entire face to detect the face region. In order to support various sizes and face directions, a plurality of standard faces suited to respective cases must be prepared to perform detection using them. However, in order to implement such process, comparison processes with a large number of templates are required, resulting in high processing cost.
The technique described in Japanese Patent No. 2767814 collates face candidate groups in the input image with pre-stored face structure, but the number of faces in the input image is limited to one or a few. Also, the face size is large to some extent, and an image in which a most region in the input image corresponds to a face, and which has a small background region is assumed as the input image. With such input image, even when face candidates are generated from all eye and mouth candidate groups, the number of face candidates is limited. However, in case of an image photographed by a normal camera or video, the face size becomes small and the background area becomes large in some cases. In such cases, a large number of eye and mouth candidates are erroneously detected from the background. Therefore, when face candidates are generated from all the eye and mouth candidate groups by the method described in Japanese Patent No. 2767814, the number of face candidates becomes huge, thus increasing the processing cost required for collation with the face structure.
Also, with the inventions described in Japanese Patent Laid-Open No. 9-44676 and Japanese Patent No. 2973676, when the background includes a large number of eye, nose, and mouth candidates, the processing cost required to collate their positional relationship becomes huge.
Furthermore, the technique described in Japanese Patent No. 2973676 holds shape data of an iris, mouth, nose, and the like. Upon obtaining two irises first, and then a mouth, nose, and the like, search regions of face parts such as a mouth, nose, and the like are limited on the basis of the positions of the irises (eyes). That is, this algorithm finds the irises (eyes) first in place of parallelly detecting face parts such as irises (eyes), a mouth, nose, and the like that form a face, and detects face parts such as a mouth and nose using the detection result of the irises. This method assumes a case wherein an image includes only one face, and the irises (eyes) are accurately obtained. If the irises (eyes) are erroneously detected, search regions of other features such as a mouth, nose, and the like cannot be normally set.
With the invention described in Japanese Patent Laid-Open No. 11-283036, in order to cope with faces with different sizes or rotated faces, region models with different sizes and rotated region models must be prepared. However, if a face with a given size or a given rotation angle is not present in practice, many wasteful calculations are made. Furthermore, in the polar coordinate transformation in the technique described in Japanese Patent Laid-Open No. 11-15973, the precision of the central coordinate position is important. However, it is difficult to detect the central coordinate position in a process of detecting the location of an object in an image.
Moreover, in the invention described in “Rotation Invariant Neural Network-Based Face Detection”, the precision of the face detection NN of the latter half depends on that of the NN of the former stage which detects the rotation angle. If the output from the NN of the former stage is wrong, face detection becomes difficult to attain. When an image includes a plurality of objects, which have different rotation angles, the input image undergoes rotation transformation using a plurality of rotation angles, and the transformed images are input to the face detection NN to perform face detection of the entire images. Hence, compared to detection of an image free from rotation, the processing cost increases considerably.
Also, a technique for identifying the pattern of an input signal by hierarchically extracting features is known. With this method, upon extracting a feature of high order, feature extraction is made using features, which form the feature to be extracted and have lower orders than that feature, thus allowing identification robust against variations of patterns to be identified. However, in order to improve the robustness against pattern variations, the number of types of features to be extracted must be increased, resulting in an increase in processing cost. However, when the number of types of features to be extracted is not increased, identification errors are more likely to occur.
To solve the aforementioned problems, Japanese Patent Publication No. 7-11819 discloses the following pattern recognition method. That is, a dictionary pattern is prepared by arranging feature vectors of patterns of respective classes in descending order of variance of a vector component, feature vectors are generated from an input pattern, feature vectors up to upper N dimensions undergo matching with the dictionary pattern, and matching with lower dimensions is conducted based on the former matching result, thus reducing the processing cost.
Japanese Patent Laid-Open No. 10-11543 discloses a pattern recognition dictionary generation device and pattern recognition apparatus, which extract feature vectors from input data, classify into clusters based on coincidence levels with standard vectors of respective clusters, and classify into categories based on coincidence levels between category standard vectors and feature vectors in clusters to which input patterns are classified, thus reducing the processing cost of matching.