In the computer vision, the object detecting technology is concluded as a subject that is applied to decide whether or not a particular object is caught in an image. As the object, there are car, walker, human face, and the like. In many applications, the object detection is appreciated as the very difficult problem. For example, in case the object is the human face, how the face is seen is changed largely depending on the direction of face, the illumination, and the partial occlusion by sunglasses, mask, or the like. Also, when a picture quality is wrong and noises are superposed on the image or when the face picked up in the image is small, it is more difficult to detect the face in the application used in the surveillance system, or the like.
As the common system for solving the problem in the object detection, there is the pattern recognizing technology based on the statistical learning. The parameter for an identifier is decided on the basis of a learning sample given previously. As the common approach in the face detection, there are approaches using neural network, support vector machine, Bayes estimation, and the like. Normally these approaches are constructed based on the feature selecting technology to extract a feature quantity used in identification from the input image, the identifier building technology to build the identifier used to decide whether or not the object is present, based on an input of the selected feature quantity, and the technology to decide whether or not a face is present in the image window, by using the built identifier. Here, the “image window” means a partial area in the input image. A large number of windows in which a position or a size of the partial area is changed can be cut out from the input image.
As the identifier building approach, there is Adaptive Boosting or Adaboost known in Non-Patent Literature 1. This approach is called as the “Adaboost learning method” hereinafter. This approach is applied to a number of object detecting apparatuses, and the approach of detecting a face from an image using this approach is set forth in Non-Patent Literature 2. In the Adaboost learning method, the identifier may have a high error rate such that a discrimination error may be set to 50% or less, and this identifier is called a weak classifier. In the Adaboost learning method, a strong classifier whose error rate is low is built up by choosing some weak classifiers among a large number of prepared weak classifiers and then assembling these weak classifiers.
As the approach of detecting the frontal face in real time by using the Adaboost learning method, there are the approaches set forth in Non-Patent Literature 2 and Patent Literature 1. In the face identifier, i.e., the face sensor set forth in Non-Patent Literature 2 and Patent Literature 1, a cascade structure in which a plurality of strong classifiers are coupled in series is employed. In the cascade structure, the coupled discriminator is called the stage and the first stage positioned closest to the input side is called a first-stage strong classifier or a stage identifier at the first stage. The identifier at each stage is build up by executing the learning based on the Adaboost learning method and then coupling a large number of weak classifiers based on the feature quantity extracted from the input image for the purpose of learning. The identifier at each stage is trained for the learning sample of the face image such that they can be identified correctly at almost 100%, but trained for the learning sample of the non-face image such that they can be identified correctly at almost 50%. The stage identifier at the first stage decides whether not the input image corresponds to the face or not, and the stage identifiers at the first stage et seq. decide respectively whether not the input image being decided as the face by the stage identifier at the first stage corresponds to the face or not. Since the input image being decided as the non-face by the stage identifier at the n-th stage is not processed much more and the decision is made as the non-face, the processing can be done effectively. Thus, it is known that the above approach can be operated at a processing speed of about 15 frames per second.
Also, there is the approach of improving an identification accuracy by building up a plurality of face sensors by using different learning samples and integrating these identified results. As an example of this approach, the Majority Voting system is shown in Non-Patent Literature 2. The authors Viola, et al. of Non-Patent Literature 2 indicates that three cascaded identifies (identifier having the cascade structure) are prepared and an identification error is reduced by the majority voting of these output results. In another application shown in Non-Patent Literature 3, the authors Rowley, et al. of Non-Patent Literature 3 trained a number of neural networks to build up the face sensor. As the method of coupling the results of a plurality of sensors, the approach of using the neural network that is trained to output the final result from a large number of neural network sensors has been proposed instead of the majority voting system.
As the method of extracting the feature quantity to sense the face, the feature called a rectangle feature has been proposed by Viola, et al in Non-Patent Literature 2. The rectangle feature of the image window is extracted by measuring a brightness difference between rectangular partial areas defined by a rectangular filter respectively.
Also, as another feature quantity extracting method, “Modified Census Transform” in Non-Patent Literature 4 has been proposed. The feature quantity is extracted by converting a 3×3 pixel block in the input image into a binary image. A brightness value of the pixel in the block is compared with a brightness average value in the block. A value 1 is labeled if the brightness value of the pixel is higher than the average value whereas a value 0 is labeled if not so. Thus, 9-bit information is obtained by arranging sequentially labels of all pixels in the block, and this information is used as a value of the feature quantity.
Patent Literature 1: US Patent Application Publication 2002/0102024 Specification
Non-Patent Literature 1: Yoav Freund, Robert E. Schapire, “A decision-theoretic generalization of on-line learning and an application to boosting”, Computational Learning Theory: Eurocolt '95, Springer-Verlag, 1995, p. 23-37
Non-Patent Literature 2: Paul Viola, Michael Jones, “Rapid Object Detection Using a Boosted Cascade of Simple Features”, IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 2001 December, ISSN: 1063-6919, Vol. 1, p. 511-518
Non-Patent Literature 3: H. Rowley, S. Baluja, T. Kanade, “Neural Network-Based Face Detection”, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), Vol. 20, No. 1, 1998 January, p. 23-28
Non-Patent Literature 4: Bernhard Froba, Andreas Ernst, “Face Detection with the Modified Census Transform”, Proceedings for Sixth IEEE International Conference on Automatic Face and Gesture Recognition (AFGR), 2004 May, p. 91-96