1. Field of the Invention
This invention relates to a detection apparatus and method for detecting, for example, a face image as a detection target image from an image in real time, a learning apparatus and method for learning data to be used by the detection apparatus, a weak hypothesis generation apparatus and method for generating a weak hypothesis in learning, and a robot apparatus equipped with the detection apparatus. This invention also relates to a facial expression recognition apparatus and method for recognizing a facial expression of a face image by detecting a specific expression from the face image, a facial expression learning apparatus and method for learning data to be used by the facial expression recognition apparatus, and a robot apparatus equipped with the facial expression recognition apparatus.
2. Description of the Related Art
Face to face communication is a real-time process operating at a time scale in the order of 40 milliseconds. The uncertainty of recognition level at this time scale is extremely high, making it necessary for humans and machines to rely on sensory rich perceptual primitives rather than slow symbolic inference processes. Thus, fulfilling the idea of machines that interact face to face with humans requires development of robust and real-time perceptual primitives.
Charles Darwin was one of the first scientists to recognize that facial expression is one of the most powerful and immediate means for human beings to communicate their emotions, intentions, and opinions to each other. In addition to providing information about affective state, facial expressions also provide information about cognitive state such as interest, boredom, confusion, and stress, and conversational signals with information about speech emphasis and syntax. Recently, a number of groundbreaking systems have appeared in the computer vision literature for facial expression recognition. (See, M. Pantic and J. M. Rothcrantz, Automatic analysis of facial expressions: State of the art, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22 (12): 1424-1445, 2000.)
To recognize facial expressions in real time, first, it is necessary to detect a face area from an input image in real time. Conventionally, many face detection techniques using only variable density patterns of image signals without using any motion from complicated image scenes have been proposed. For example, a face detector described in the following non-patent reference 1 includes a cascade of classifiers, each of which contains a filter such as Haar Basis function as a discriminator. When generating the discriminator based on learning, high-speed learning is realized by using images called integral images, which will be described later, and rectangle features.
FIGS. 1A to 1D are schematic views showing rectangle features described in the following non-patent reference 1. In the technique described in non-patent reference 1, as shown in FIGS. 1A to 1D, in input images 200A to 200D, plural filters (also referred to as rectangle features) are prepared, each of which finds the sum of brightness values in adjacent rectangular boxes of the same size and outputs the difference between the sum of brightness values in one or plural rectangular boxes and the sum of brightness values in the other rectangular boxes. For example, as shown in FIG. 1A, in the input image 200A, a filter 201A is shown, which subtracts the sum of brightness values in a shaded rectangular box 201A-2 from the sum of brightness values in a rectangular box 201A-1 is shown. Such a filter including two rectangular boxes is called two-rectangle feature. As shown in FIG. 1C, in the input image 200C, a filter 201C is shown, which is made up of three rectangular boxes 201C-1 to 201C-3 formed by dividing one rectangular box into three boxes, and which subtracts the sum of brightness values in the shaded central rectangular box 201C-2 from the sum of brightness values in the rectangular boxes 201C-1 and 201C-3. Such a filter including three rectangular boxes is called three-rectangle feature. Moreover, as shown in FIG. 1D, in the input image 200D, a filter 201D is shown, which is made up of four rectangular boxes 201D-1 to 201D-4 formed by vertically and horizontally dividing one rectangular box into four boxes, and which subtracts the sum of brightness values in the shaded rectangular boxes 201D-2 and 201D-4 from the sum of brightness values in the rectangular boxes 201D-1 and 201D-3. Such a filter including four rectangular boxes is called four-rectangle feature.
For example, judging a face image shown in FIG. 2 as a face image by using rectangle features as described above will now be described. A two-rectangle feature (filter) 211B includes two rectangular boxes 211B-1 and 211B-2 formed by vertically bisecting one rectangular box, and subtracts the sum of brightness values in the shaded rectangular box 211B-1 from the sum of brightness values in the lower rectangular box 211B-2. If the fact that the brightness value is lower in the eye area than in the cheek area is utilized with respect to a human face image (detection target) 210, whether an input image is a face image or not (correct or incorrect) can be estimated with certain probability from an output value of the rectangle feature 211B.
The three-rectangle feature (filter) 211C is a filter that subtracts the sum of brightness values in the left and right rectangular boxes 211C-1 and 211C-3 from the sum of brightness values in the central rectangular box 211C-2. Similar to the above-described case, if the fact that the brightness value is higher in the nose area than in the both of the eye areas is utilized with respect to the human face image 210, whether an input image is a face image or not can be judged to a certain degree from an output value of the rectangle feature 211C.
At the time of detection, in order to detect face areas of various sizes included in an input image, it is necessary to cut out areas of various sizes (hereinafter referred to as search windows) for judgment. However, an input image made up of, for example, 320×240 pixels contains search windows of approximately 50,000 sizes, and arithmetic operations on all these window sizes are very time-consuming.
Thus, according to non-patent reference 1, images called integral images are used. An integral image can be generated, for example, by carrying out an operation to add the pixel value of a position to the sum of the pixel value of the position that is immediately above and the pixel value of the position that is immediately left in the image, sequentially from the upper left part. It is an image in which the pixel value of an arbitrary position is the sum of brightness values in a rectangular box that is upper left side of this position. If integral images are found in advance, the sum of brightness values in a rectangular box in an image can be calculated simply by adding or subtracting the pixel values of the four corners of the rectangular box, and therefore the sum of brightness values in the rectangular box can be calculated at a high speed.
Moreover, in non-patent reference 1, a strong discrimination machine is used as a face detection apparatus, which uses many training data (learning samples), sequentially generates a discriminator based on the result of calculation using integral images, and discriminates whether an input image is a face image or not by weighted vote among outputs from many discriminators. FIG. 3 shows an essential part of the face detection apparatus described in non-patent reference 1. As shown in FIG. 3, all window images (subwindows) 241 cut out from an input image are inputted to the face detection apparatus. Then, one discriminator sequentially outputs “correct” (=1) or “incorrect” (=−1). If the result of weighted addition of these results in accordance with the reliability (lowness of error rate) of the discriminator is a positive value, it is assumed that a face exists in the window image and face detection is carried out. Since the face detector includes many discriminators, it is time-consuming to cut out window images of difference sizes from the input image and then take weighted vote among the results of discrimination by all the discriminators with respect to all the window images. Thus, in non-patent reference 1, plural classifiers 240A, 240B, 240C, . . . are prepared, each of which has plural discriminators, and these plural classifiers are cascaded. Each of the classifiers 240A, 240B, 240C, . . . once judges whether a window image is a face image or not from its output. With respect to data 242A, 242B, 242C, . . . judged as non-face images, judgment processing is interrupted at this point and only data 2 judged as a face image by one classifier is supplied to the next-stage classifier. The plural discriminators constituting the next-stage classifier newly perform weighted addition and majority vote. As such processing is repeated, high-speed processing in face detection is realized.
Non-Patent Reference 1: Paul Viola and Michael Jones, Robust real-time object detection, Technical Report CRL 2001/01, Cambridge Research Laboratory, 2001.
However, with respect to the rectangle features described in the above-described non-patent reference 1, there are 160,000 or more possible filters to be selected, depending on the number of pixels constituting the filters (sizes of filters) and the types of filters such as two-, three- and four-rectangle features, even if a target area (window image) is limited to, for example, a 24×24-pixel area. Therefore, in learning, an operation to select, for example, one filter that provides a minimum error rate from the 160,000 or more filters, for example, for several hundred labeled training data and thus generate a discriminator must be repeated, for example, several hundred times corresponding to the number of weighted votes. Therefore, an extremely large quantity of arithmetic operation is required and learning processing is very time-consuming.
Moreover, in the case of discriminating a face from an input image by using a final hypothesis made up of many weak hypotheses acquired by learning, as discrimination is made by a classifier made up of plural weak hypotheses, as described above, the quantity of arithmetic operation is reduced, compared with the case of making weighted vote among sum values of all the weak hypotheses, and the discrimination processing speed can be improved. However, since each classifier needs to similarly take weighted vote, the processing is time-consuming.