Field of the Disclosure
The present disclosure generally relates to image processing and image recognition, and more particularly, to a system and method for recognizing and detecting 3D (three-dimensional) objects in 2D (two-dimensional) images using Bayesian network based classifiers.
Brief Description of Related Art
Object detection is the technique of using computers to automatically locate objects in images, where an object can be any type of a three dimensional physical entity such as a human face, an automobile, an airplane, etc. Object detection involves locating any object that belongs to a category such as the class of human faces, automobiles, etc. For example, a face detector would attempt to find all human faces in a photograph.
A challenge in object detection is coping with all the variations in appearance that can exist within a class of objects. FIG. 1A illustrates a picture slide 10 showing some variations in appearance for human faces. For example, the class of human faces may contain human faces for males and females, young and old, bespectacled with plain eyeglasses or with sunglasses, etc. Similarly, for example, another class of objects—cars (not shown)—may contain cars that vary in shape, size, coloring, and in small details such as the headlights, grill, and tires. In case of humans, a person's race, age, gender, ethnicity, etc., may play a dominant role in defining the person's facial features. Also, the visual expression of a face may be different from human to human. One face may appear jovial whereas the other one may appear sad and gloomy. Visual appearance also depends on the surrounding environment and lighting conditions as illustrated by the picture slide 12 in FIG. 1B. Light sources will vary in their intensity, color, and location with respect to the object. Nearby objects may cast shadows on the object or reflect additional light on the object. Furthermore, the appearance of the object also depends on its pose, that is, its position and orientation with respect to the camera. In particular, a side view of a human face will look much different than a frontal view. FIG. 1C shows a picture slide 14 illustrating geometric variation among human faces. Various human facial geometry variations are outlined by rectangular boxes superimposed on the human faces in the slide 14 in FIG. 1C.
Therefore, a computer-based object detector must accommodate all these variations and still distinguish the object from any other pattern that may occur in the visual world. For example, a human face detector must be able to find faces regardless of facial expression, variations in the geometrical relationship between the camera and the person, or variation in lighting and shadowing. Most methods for object detection use statistical modeling to represent this variability. Statistics is a natural way to describe a quantity that is not fixed or deterministic, such as a human face. The statistical approach is also versatile. The same statistical modeling techniques can potentially be used to build object detectors for different objects without re-programming.
Techniques for object detection in two-dimensional images differ primarily in the statistical model they use. One known method represents object appearance by several prototypes consisting of a mean and a covariance about the mean. Another known technique consists of a quadratic classifier. Such a classifier is mathematically equivalent to the representation of each class by its mean and covariance. These and other known techniques emphasize statistical relationships over the full extent of the object. As a consequence, they compromise the ability to represent small areas in a rich and detailed way. Other known techniques address this limitation by decomposing the model in terms of smaller regions. These methods can represent appearance in terms of a series of inner products with portions of the image. Finally, another known technique decomposes appearance further into a sum of independent models for each pixel.
The known techniques discussed above are limited, however, in that they represent the geometry of the object as a fixed rigid structure. This limits their ability to accommodate differences in the relative distances between various features of a human face such as the eyes, nose, and mouth. Not only can these distances vary from person to person, but also their projections into the image can vary with the viewing angle of the face. For this reason, these methods tend to fail for faces that are not in a fully frontal posture. This limitation is addressed by some known techniques, which allow for small amounts of variation among small groups of handpicked features such as the eyes, nose, and mouth. However, because they use a small set of handpicked features, these techniques have limited power. Another known technique allows for geometric flexibility with a more powerful representation by using richer features (each takes on a large set of values) sampled at regular positions across the full extent of the object. Each feature measurement is treated as statistically independent of all others. The disadvantage of this approach is that any relationship not explicitly represented by one of the features is not represented in the statistical model. Therefore, performance depends critically on the quality of the feature choices.
Additionally, all of the above techniques are structured such that the entire statistical model must be evaluated against the input image to determine if the object is present. This can be time consuming and inefficient. In particular, since the object can appear at any position and any size within the image, a detection decision must be made for every combination of possible object position and size within an image. It is therefore desirable to detect a 3D object in a 2D image over a wide range of variation in object location, orientation, and appearance.
It is also known that object detection may be implemented by forming a statistically based classifier to discriminate the object from other visual scenery. Such a scheme, however, requires choosing the form of the statistical representation and estimating the statistics from labeled training data. As a result, the overall accuracy of the detection program can be dependent on the skill and intuition of the human programmer. It is therefore desirable to design as much of the classifier as possible using automatic methods that infer a design based on actual labeled data in a manner that is not dependant on human intuition.
Furthermore, even with very high speed computers, known object detection techniques can require an exorbitant amount of time to operate. It is therefore also desirable to perform the object detection in a computationally advantageous manner so as to conserve time and computing resources.
It is also desirable to not only expeditiously and efficiently perform accurate object detection, but also to be able to perform object recognition to ascertain whether two input images belong to the same class of object or to different classes of objects, where often the notion of class is more specific such as images of one person.