1. Field of the Invention
The present invention broadly relates to image processing and image recognition, and more particularly, to a system and method for detecting presence of 3D (three dimensional) objects in a 2D (two dimensional) image containing 2D representation of the 3D objects.
2. Description of the Related Art
Object recognition is the problem of using computers to automatically locate objects in images, where an object can be any type of three dimensional physical entity such as a human face, automobile, airplane, etc. Object detection involves locating any object that belongs to a category such as the class of human faces, automobiles, etc. For example, a face detector would attempt to find all human faces in a photograph, but would not make finer distinctions such as identifying each face.
The challenge in object detection is coping with all the variations that can exist within a class of objects and the variations in visual appearance. FIG. 1A illustrates a picture slide 10 showing intra-class variations for human faces and cars. For example, cars vary in shape, size, coloring, and in small details such as the headlights, grill, and tires. Similarly, the class of human faces may contain human faces for males and females, young and old, bespectacled with plain eyeglasses or with sunglasses, etc. Also, the visual expression of a face may be different from human to human. One face may appear jovial whereas the other one may appear sad and gloomy. Visual appearance also depends on the surrounding environment and lighting conditions as illustrated by the picture slide 12 in FIG. 1B. Light sources will vary in their intensity, color, and location with respect to the object. Nearby objects may cast shadows on the object or reflect additional light on the object. Furthermore, the appearance of the object also depends on its pose; that is, its position and orientation with respect to the camera. FIG. 1C shows a picture slide 14 illustrating geometric variation among human faces. A person""s race, age, gender, ethnicity, etc., may play a dominant role in defining the person""s facial features. A side view of a human face will look much different than a frontal view.
Therefore, a computer-based object detector must accommodate all this variation and still distinguish the object from any other pattern that may occur in the visual world. For example, a human face detector must be able to find faces regardless of facial expression, variation from person to person, or variation in lighting and shadowing. Most methods for object detection use statistical modeling to represent this variability. Statistics is a natural way to describe a quantity that is not fixed or deterministic such as a human face. The statistical approach is also versatile. The same statistical model can potentially be used to build object detectors for different objects without re-programming.
Prior success in object detection has been limited to frontal face detection. Little success has been reported in detection of side profile) views of faces or of other objects such as cars. Prior methods for frontal face detection include methods described in the following publications: (1) U.S. Pat. No. 5,642,431, titled xe2x80x9cNetwork-based System And Method For Detection of Faces And The Likexe2x80x9d, issued on Jun. 24, 1997 to Poggio et al.; (2) U.S. Pat. No. 5,710,833, titled xe2x80x9cDetection Recognition And Coding of Complex Objects Using Probabilistic Eigenspace Analysisxe2x80x9d, issued on Jan. 20, 1998 to Moghaddam et al.; (3) U.S. Pat. No. 6,128,397, titled xe2x80x9cMethod For Finding All Frontal Faces In Arbitrarily Complex Visual Scenesxe2x80x9d, issued on Oct. 3, 2000 to Baluja et al.; (4) Henry A. Rowley, Shumeet Baluja, and Takeo Kanade, xe2x80x9cNeural Network-Based Face Detectionxe2x80x9d, IEEE Transactions on Pattern Analysis and Machine Intelligence, 20:1, January 1998, pp. 23-28; (5) Edgar Osuna, Robert Freund, and Federico Girosi, xe2x80x9cTraining Support Vector Machines: An Application To Face Detectionxe2x80x9d, Conference on Computer Vision and Pattern Recognition, 1997, pp. 130-136; (6) M. C. Burl and P. Perona, xe2x80x9cRecognition of Planar Object Classesxe2x80x9d, Conference on Computer Vision and Pattern Recognition, 1996, pp. 223-230; (7) H. Schneiderman and T. Kanade, xe2x80x9cProbabilistic Modeling of Local Appearance and Spatial Relationships for Object Recognitionxe2x80x9d, Conference on Computer Vision and Pattern Recognition, 1998, pp. 45-51; (8) L. Wiskott, J-M Fellous, N. Kruger, C. v. d. Malsburg, xe2x80x9cFace Recognition by Elastic Bunch Matchingxe2x80x9d, IEEE Transactions on Pattern Analysis and Machine Intelligence, 19:7, 1997, pp. 775-779; and (9) D. Roth, M-H Yang, and N. Ahuja, xe2x80x9cA SnoW-Based Face Detectorxe2x80x9d, NIPS-12 (Neural Information Processing Systems), 1999.
The methods discussed in publications (1) through (9) mentioned above differ primarily in the statistical model they use. The method of publication (1) represents object appearance by several prototypes consisting of a mean and a covariance about the mean. The method in publication (5) consists of a quadratic classifier. Such a classifier is mathematically equivalent to representation of each class by its mean and covariance. These methods as well as that of publication (2) emphasize statistical relationships over the full extent of the object. As a consequence, they compromise the ability to represent small areas in a rich and detailed way. The methods discussed in publications (3) and (4) address this limitation by decomposing the model in terms of smaller regions. The methods in publications (3) and (4) represent appearance in terms of approximately 100 inner products with portions of the image. Finally, the method discussed in publication (9) decomposes appearance further into a sum of independent models for each pixel.
However, the above methods are limited in that they represent the geometry of the object as a fixed rigid structure. These methods are also limited in their ability to accommodate differences in the relative distances between various features of a human face such as the eyes, nose, and mouth. Not only can these distances vary from person to person, but their projections into the image can vary with the viewing angle of the face. For this reason, these methods tend to fail for faces that are not fully frontal in posture. This limitation is addressed by the publications (6) and (8), which allow for small amounts of variation among small groups of hand-picked features such as the eyes, nose, and mouth. However, by using a small set of hand-picked features these representations have limited power. The method discussed in publication (7) allows for geometric flexibility with a more powerful representation by using richer features (each takes on a large set of values) sampled at regular positions across the fall extent of the object. Each feature measurement is treated as statistically independent of all others. The disadvantage of this approach is that any relationship not explicitly represented by one of the features is not represented. Therefore, performance depends critically on the quality of the feature choices.
Finally, all of the above methods are structured such that the entire statistical model must be evaluated against the input image to determine if the object is present. This can be time consuming and inefficient. In particular, since the object can appear at any position and any size within the image, a detection decision must be made for every combination of possible object position and size within an image. It is therefore desirable to detect a 3D object in a 2D image over a wide range of variation in object location, orientation, and appearance. It is also desirable to perform the object detection in a computationally advantageous manner so as to conserve time and computing resources.
In one embodiment, the present invention contemplates a method to detect presence of a 3D (three dimensional) object in a 2D (two dimensional) image containing a 2D representation of the 3D object. The method comprises receiving a digitized version of the 2D image; selecting one or more view-based detectors; for each view-based detector, computing a wavelet transform of the digitized version of the 2D image, wherein the wavelet transform generates a plurality of transform coefficients, and wherein each transform coefficient represents visual information from the 2D image that is localized in space, frequency, and orientation; applying the one or more view-based detectors in parallel to respective plurality of transform coefficients, wherein each view-based detector is configured to detect a specific orientation of the 3D object in the 2D image based on visual information received from corresponding transform coefficients; combining results of application of the one or more view-based detectors; and determining orientation and location of the 3D object from the combination of results of application of the one or more view-based detectors.
In an alternative embodiment, the present invention contemplates a method of providing assistance in detecting the presence of a 3D object in a 2D image. The method comprises receiving a digitized version of the 2D image from a client site and over a communication network (e.g., the Internet); determining the location of the 3D object in the 2D image; and sending a notification of the location of the 3D object to the client site over the communication network.
In a still further embodiment, the present invention contemplates a computer-readable storage medium having stored thereon instructions, which, when executed by a processor, cause the processor to perform a number of tasks including the following: digitize a 2D image containing a 2D representation of a 3D object; compute a wavelet transform of the digitized version of the 2D image, wherein the wavelet transform generates a plurality of transform coefficients, and wherein each transform coefficient represents corresponding visual information from the 2D image; place an image window of fixed size at a first plurality of locations within the 2D image; evaluate a plurality of visual attributes at each of the first plurality of locations of the image window using corresponding transform coefficients; and estimate the presence of the 3D object in the 2D image based on evaluation of the plurality of visual attributes at the each of the first plurality of locations.
An object finder program according to the present invention improves upon existing methods of 3D object detection both in accuracy and computational properties. These improvements are based around the use of the wavelet transform for object detection. A pre-selected number of view-based detectors are trained on sample 2D images prior to performing the detection on an unknown 2D image. These detectors then operate on the given 2D input image and compute a quantized wavelet transform for the entire input image. The object detection then proceeds with sampling of the quantized wavelet coefficients at different image window locations on the input image and efficient look-up of pre-computed log-likelihood tables to determine object presence. The object finder""s coarse-to-fine object detection strategy coupled with exhaustive object search across different positions and scales results in an efficient and accurate object detection scheme. The object finder detects a 3D object over a wide range in angular variation (e.g., 180 degrees) through the combination of a small number of detectors each specialized to a small range within this range of angular variation.
The object finder may be trained to detect many different types of objects (e.g., airplanes, cats, trees, etc.) besides the human faces and cars as discussed hereinbelow. Some of the applications where the object finder may be used include: commercial image databases (e.g., stock photography) for automatically labeling and indexing images; an Internet-based image searching and indexing service; finding objects of military interest (e.g., mines, tanks, etc.) in satellite, radar, or visible imagery; as a tool for automatic description of the image content of an image database; to achieve accurate color balancing on human faces and remove red-eye from human faces in a digital photo development; for automatic adjustment of focus, contrast, and centering on human faces during digital photography; and enabling automatic zooming on human faces as part of a security and surveillance system.