Conventionally, in the fields of image recognition and speech recognition, a recognition processing algorithm specialized to a specific object to be recognized is implemented by computer software or hardware using a dedicated parallel image processing processor, thus detecting an object to be recognized.
Especially, some references about techniques for detecting a face as a specific object to be recognized from an image including the face have been conventionally disclosed (for example, see patent references 1 to 5).
According to one of these techniques, an input image is searched for a face region using a template called a standard face, and partial templates are then applied to feature point candidates such as eyes, nostrils, mouth, and the like to authenticate a person. However, this technique is vulnerable to a plurality of face sizes and a change in face direction, since the template is initially used to match the entire face to detect the face region. To solve such problem, a plurality of standard faces corresponding to different sizes and face directions must be prepared to perform detection. However, the template for the entire face has a large size, resulting in high processing cost.
According to another technique, eye and mouth candidate groups are obtained from a face image, and face candidate groups formed by combining these groups are collated with a pre-stored face structure to find regions corresponding to the eyes and mouth. According to this technique, the number of faces in the input image is one or a few, the face size is large to some extent, and an image in which a most region in the input image corresponds to a face, and which has a small background region is assumed as the input image.
According to still another technique, a plurality of eye, nose, and mouth candidates are obtained, and a face is detected on the basis of the positional relationship among feature points, which are prepared in advance.
According to still another technique, upon checking matching levels between shape data of respective parts of a face and an input image, the shape data are changed, and search regions of respective face parts are determined based on the previously obtained positional relationship of parts. With this technique, shape data of an iris, mouth, nose, and the like are held. Upon obtaining two irises first, and then a mouth, nose, and the like, search regions of face parts such as a mouth, nose, and the like are limited on the basis of the positions of the irises. That is, this algorithm finds the irises (eyes) first in place of parallelly detecting face parts such as irises (eyes), a mouth, nose, and the like that form a face, and detects face parts such as a mouth and nose using the detection result of the irises. This method assumes a case wherein an image includes only one face, and the irises are accurately obtained. If the irises are erroneously detected, search regions of other features such as a mouth, nose, and the like cannot be normally set.
According to still another technique, a region model set with a plurality of determination element acquisition regions is moved in an input image to determine the presence/absence of each determination element within each of these determination element acquisition regions, thus recognizing a face. In this technique, in order to cope with faces with different sizes or rotated faces, region models with different sizes and rotated region models must be prepared. If a face with a given size or a given rotation angle is not present in practice, many wasteful calculations are made.
Some methods of recognizing an expression of a face in an image have been conventionally proposed (for example, see non-patent references 1 and 2).
One of these techniques is premised on that partial regions of a face are visually accurately extracted from a frame image. In another technique, rough positioning of a face pattern is automated, but positioning of feature points requires visual fine adjustment. In still another technique (for example, see patent reference 6), expression elements are converted into codes using muscle actions, a neural system connection relationship, and the like, thus determining an emotion. However, with this technique, regions of parts required to recognize an expression are fixed, and regions required for recognition are likely to be excluded or unwanted regions are likely to be included, thus adversely influencing the recognition precision of the expression.
In addition, a system that detects a change corresponding to an Action Unit of FACS (Facial Action Coding System) known as a method of objectively describing facial actions, so as to recognize an expression has been examined.
In still another technique (for example, see patent reference 7), an expression is estimated in real time to deform a three-dimensional (3D) face model, thus reconstructing the expression. With this technique, a face is detected based on a difference image between an input image which includes a face region and a background image which does not include any face region, and a chromaticity value indicating a flesh color, and the detected face region is then binarized to detect the contour of the face. The positions of eyes and a mouth are obtained from the region within the contour, and a rotation angle of the face is calculated based on the positions of the eyes and mouth to apply rotation correction. After that, two-dimensional (2D) discrete cosine transforms are calculated to estimate an expression. The 3D face model is converted based on a change amount of a spatial frequency component, thereby reconstructing the expression. However, detection of flesh color is susceptible to variations of illumination and the background. For this reason, in this technique, non-detection or erroneous detection of an object is more likely to occur in the first flesh color extraction process.
As a method of identifying a person based on a face image, the Eigenface method (Turk et. al.) is well known (for example, see non-patent references 3 and 4). With this method, principal component analysis is applied to a set of density value vectors of many face images to calculate orthonormal bases called eigenfaces, and the Karhunen-Loeve expansion is applied to the density value vector of an input face image to obtain a dimension-compressed face pattern. The dimension-compressed pattern is used as a feature vector for identification.
As one of methods for identifying a person in practice using the feature vector for identification, the above reference presents a method of calculating the distances between the dimension-compressed face pattern of an input image and those of persons, which are held, and identifying a class to which the pattern with the shortest distance belongs as a class to which the input face image belongs, i.e., a person. However, this method basically uses a corrected image as an input image, which is obtained in such a manner that the position of a face in an image is detected using an arbitrary method, and the face region undergoes size normalization and rotation correction to obtain a face image.
An image processing method that can recognize a face in real time has been disclosed as a prior art (for example, see patent reference 8). In this method, an arbitrary region is extracted from an input image, and it is checked if that region corresponds to a face region. If that region is a face region, matching between a face image that has undergone affine transformation and contrast correction, and faces that have already been registered in a learning database is made to estimate the probabilities that this is the same person. Based on the probabilities, a person who is most likely to be the same as the input face of the registered persons is output.
As one of conventional expression recognition apparatuses, a technique for determining an emotion from an expression has been disclosed (for example, see patent reference 6). An emotion normally expresses a feeling such as anger, grief, and the like. According to the above technique, the following method is available. That is, predetermined expression elements are extracted from respective features of a face on the basis of relevant rules, and expression element information is extracted from the predetermined expression elements. Note that the expression elements indicate an open/close action of an eye, an action of a brow, an action of a metope, an up/down action of lips, an open/close action of the lips, and an up/down action of a lower lip. The expression element for a brow action includes a plurality of pieces of facial element information such as the slope of the left brow, that of the right brow, and the like.
An expression element code that quantifies the expression element is calculated from the plurality of pieces of expression element information that form the obtained expression element on the basis of predetermined expression element quantization rules. Furthermore, an emotion amount is calculated for each emotion category from the predetermined expression element code determined for each emotion category using a predetermined emotion conversion formula. Then, a maximum value of emotion amounts of each emotion category is determined as an emotion.
The shapes and lengths of respective features of faces have large differences depending on persons. For example, some persons who have eyes slanting down outwards, narrow eyes, and so forth in their emotionless images as sober faces, look deceptively joyful from perceptual viewpoints based on such images, but they are simply keeping their faces straight. Furthermore, face images do not always have constant sizes and directions of faces. When the face size has varied or the face has rotated, required feature amounts must be normalized in accordance with the face size variation or face rotation variation.
When time-series images that assume a daily scene including a non-expression scene as a conversation scene in addition to an expression scene and a non-expression scene as a sober face image are used as an input image, for example, non-expression scenes such as a pronunciation “o” in a conversation scene similar to an expression of surprise, pronunciations “i” and “e” similar to expressions of joy, and the like may be erroneously determined as expression scenes.    Patent reference 1: Japanese Patent Laid-Open No. 9-251534    Patent reference 2: Japanese Patent No. 2767814    Patent reference 3: Japanese Patent Laid-Open No. 9-44676    Patent reference 4: Japanese Patent No. 2973676    Patent reference 5: Japanese Patent Laid-Open No. 11-283036    Patent reference 6: Japanese Patent No. 2573126    Patent reference 7: Japanese Patent No. 3062181    Patent reference 8: Japanese Patent Laid-Open No. 2003-271958    Non-patent reference 1: G. Donate, T. J. Sejnowski, et. al, “Classifying Facial Actions” IEEE Trans. PAMI, vol. 21, no. 10, October 1999    Non-patent reference 2: Y. Tian, T. Kaneda, and J. F. Cohn “Recognizing Action Units for Facial Expression Analysis” IEEE tran. PAMI vol. 23, no. 2, February 2001    Non-patent reference 3: Shigeru Akamatsu “Computer Facial Recognition—Survey—”, the Journal of IEICE Vol. 80, No. 8, pp. 2031-2046, August 1997    Non-patent reference 4: M. Turk, A. Pentland, “Eigenfaces for recognition” J. Cognitive Neurosci., vol. 3, no. 1, pp. 71-86, March 1991