Much recent research in the computer vision field concerns the abilities to find “regularly configured objects” (also referred to hereafter by the shorter term “objects”) in digital images. Objects can be defined as classes of real-world entities that obey consistent rules with respect to geometric shape and appearance. Examples of objects includes human faces, whole people, cars, animals, and buildings.
The ability to computationally find objects in images varies directly with the extrinsic and intrinsic consistency of the class appearance. “Extrinsic consistency” refers to factors, such as lighting and pose, that can vary dramatically and are independent of the physical properties of the objects. “Intrinsic consistency” refers to factors due to the physical properties of the objects. Clearly, some types of objects present more variation in their appearance than others. For example, the class of human faces obeys fairly tight geometric constraints on overall physical shape and the relative placement of component parts, such as eyes, nose, and mouth. The class of whole people, on the other hand, exhibits far more variation in appearance due to the articulated nature of the limbs and the bending abilities of the human body.
Past applications of object detection have tended to focus on tightly-defined object classes with limited appearance variability, such as the set of human faces. Successful object detectors, including face detectors, have been constructed and commercially deployed. One common weakness of such detectors is a capability to deal with only a limited range of object poses or viewpoints.
There can be many sources of variation in the appearance of human faces in images: personal identity, pose, illumination, deformation, and imaging process parameters, to name the most important. Surprisingly, of the sources mentioned, changing specific individual identity may contribute a smaller amount to the change in appearance of an imaged face than do the other factors. This statement is true when using almost any non-cognitive measure of appearance similarity. This may seem surprising and even counter-intuitive in light of the great facility of the human observer's identification of individual persons. Such facility might seem to imply that there are substantial invariant aspects of the appearance of individual persons over disparate viewing conditions. But this conclusion is not true. The situation here resembles to some extent the cognitive process of color constancy. There, highly sophisticated physiological and psychological mechanisms of eye, retina, and brain combine to create the perceptual illusion of color constancy that can be easily dispelled with a roll of color film and some variety of illuminants. (The film serves the role of a much more primitive imaging system than that of the human observer.) Similarly here, in the domain of facial appearance, there is widespread physiological evidence that exceedingly sophisticated visual and mental processes, including tailored neuronal support, necessarily underlay the seemingly effortless recognition of human individuals.
The situation is otherwise when considering quantitative (i.e. mathematical) measures of appearance similarity of faces. The effects of illumination and pose can produce much greater changes to the appearance of faces than are caused by identity difference. The difference referenced here is that of a quantitative measure, such as mean square error, Mahalanobis distance, or the like.
In many applications of face detection to commercial tasks, finding the faces serves as an enabling technology to subsequent processing steps. Depending on the speed requirements of that processing, it may be true that insufficient speed in the face detection portion of the processing would render the overall tasks unsuccessful. For this reason, many recent approaches have concentrated on producing algorithms that operate in the most rapid possible manner. These approaches tend to have the disadvantage of a large false positive rate, particularly when out-of-plane rotation is present. A further problem of many of the above approaches is dependence upon correction orthogonal orientation of images prior to object location. An example of such is face location limited to a nominal upright image orientation and small ranges of rotation away from the nominal orientation. Possible ameliorative measures, such as multiple passes, additional training, and use of multiple classifiers add complexity and require additional time.
To handle in-plane rotation without multiple applications of a classifier trained for upright frontal faces, and hence with the attendant performance penalties, some researchers have prefaced their classifiers with a module that can determine the in-plane rotational orientation of a face candidate in a window. These approaches have been referred to as “invariance-based methods”, since the face detector examines image data rotated to make the purported face appear in nominal upright position.
In “Face Detection using the 1st-order RCE Classifier”, Proc. IEEE Int'l Conf. Image Processing, Jeon, B., Lee, S., and Lee, K., 2002; an orientation module acts on test windows before examination by the main classifier. The module estimates the most likely orientation of a face, should a face happen to be present, returning a meaningless indication otherwise. The orientation module first binarizes a test window using an adaptive threshold, and then searches for best-fit rotational orientation of the purported eye lines (judging the eyes to be the most reliable facial feature) in one degree increments. The subsequent classifier, based on first-order reduced Coulomb energy, examines the test window explicitly rotated based on the estimation of the orientation module.
Similarly, in “Rotation Invariant Neural Network-Based Face Detection”, Proc. IEEE Conf. Computer Vision and Pattern Recognition, Rowley, H., Baluja, S., and Kanade, T., 1998, 38-44 a neural-network based rotation-estimation module is provided prior to the main classifier, which also happens to be implemented with a receptive-field neural network architecture.
The approaches of these publications have the shortcoming of a risk that the time cost of the rotation estimation would approach that of the face classification decision, leading to a substantial speed loss in the overall detection system. An additional shortcoming is that these approaches are not extended to situations, in which both in-plane and out-of-plane rotation are substantial.
Use of integral images in fast computational methods is disclosed in “Robust Real-Time Object Recognition”, Proc. Second International Workshop on Statistical and Computational Theories of Vision—Modeling Learning, Computing, and Sampling, Viola, P. and Jones, M., 2001.
Dimensionality reduction has been widely applied to machine learning applications, as well as many other data processing tasks such as noise reduction, data modeling, and data transformation as a solution to some types of over-training.
Solutions to the problem of estimation of probability distribution in data are well known and can be divided broadly into three categories—parametric, non-parametric, and a mid-ground semi-parametric category.
Many different kinds of face detectors are known in the computer vision literature. Common methods involve neural-network based detectors Rowley, H., Baluja, S., and Kanade, T., “Rotation Invariant Neural Network-Based Face Detection”, Proc. IEEE Conf. Computer Vision and Pattern Recognition 1998, 38-44, domain-division cascaded classifier Viola, Paul, and Jones, Michael, “Robust Real-Time Object Recognition”, Proc. Second International Workshop on Statistical and Computational Theories of Vision—Modeling, Learning, Computing, and Sampling, 2001, and heterogeneous cascaded classifiers Feraud, Raphael, et al., “A Fast and Accurate Face Detector Based on Neural Networks”, IEEE Trans. Pattern Analysis and Machine Intelligence, 23(1), 42-53. Another example of an object detector that can be used as a face detector is provided in Schneiderman, H., and Kanade, T., “Probabilistic Modeling of Local Appearance and Spatial Relationships for Object Recognition”, Proc. CVPR 1998, 45-51.
It would thus be desirable to provide improved methods, computer systems, and computer program products, in which the most likely in-plane orientation of objects can be rapidly estimated and in which substantial out-of-plane rotation is tolerated.