Object recognition is the technique of using computers to automatically locate objects in images, where an object can be any type of three dimensional physical entity such as a human face, automobile, airplane, etc. Object detection involves locating any object that belongs to a category such as the class of human faces, automobiles, etc. For example, a face detector would attempt to find all human faces in a photograph, but would not make finer distinctions such as identifying each face.
The challenge in object detection is coping with all the variations in appearance that can exist within a class of objects. FIG. 1A illustrates a picture slide 10 showing some variations for human faces and cars. For example, cars vary in shape, size, coloring, and in small details such as the headlights, grill, and tires. Similarly, the class of human faces may contain human faces for males and females, young and old, bespectacled with plain eyeglasses or with sunglasses, etc. A person's race, age, gender, ethnicity, etc., may play a dominant role in defining the person's facial features. Also, the visual expression of a face may be different from human to human. One face may appear jovial whereas the other one may appear sad and gloomy. Visual appearance also depends on the surrounding environment and lighting conditions as illustrated by the picture slide 12 in FIG. 1B. Light sources will vary in their intensity, color, and location with respect to the object. Nearby objects may cast shadows on the object or reflect additional light on the object. Furthermore, the appearance of the object also depends on its pose; that is, its position and orientation with respect to the camera. In particular, a side view of a human face will look much different than a frontal view. FIG. 1C shows a picture slide 14 illustrating geometric variation among human faces.
Therefore, a computer-based object detector must accommodate all this variation and still distinguish the object from any other pattern that may occur in the visual world. For example, a human face detector must be able to find faces regardless of facial expression, variation from person to person, or variation in lighting and shadowing. Most methods for object detection use statistical modeling to represent this variability. Statistics is a natural way to describe a quantity that is not fixed or deterministic such as a human face. The statistical approach is also versatile. The same statistical modeling techniques can potentially be used to build object detectors for different objects without re-programming.
Techniques for object detection in two-dimensional images differ primarily in the statistical model they use. One known method represents object appearance by several prototypes consisting of a mean and a covariance about the mean. Another known technique consists of a quadratic classifier. Such a classifier is mathematically equivalent to the representation of each class by its mean and covariance. These and other known techniques emphasize statistical relationships over the full extent of the object. As a consequence, they compromise the ability to represent small areas in a rich and detailed way. Other known techniques address this limitation by decomposing the model in terms of smaller regions. These methods can represent appearance in terms of a series of inner products with portions of the image. Finally, another known technique decomposes appearance further into a sum of independent models for each pixel.
The known techniques discussed above are limited, however, in that they represent the geometry of the object as a fixed rigid structure. This limits their ability to accommodate differences in the relative distances between various features of a human face such as the eyes, nose, and mouth. Not only can these distances vary from person to person, but also their projections into the image can vary with the viewing angle of the face. For this reason, these methods tend to fail for faces that are not in a fully frontal posture. This limitation is addressed by some known techniques, which allow for small amounts of variation among small groups of handpicked features such as the eyes, nose, and mouth. However, by using a small set of handpicked features these techniques have limited power. Another known technique allows for geometric flexibility with a more powerful representation by using richer features (each takes on a large set of values) sampled at regular positions across the full extent of the object. Each feature measurement is treated as statistically independent of all others. The disadvantage of this approach is that any relationship not explicitly represented by one of the features is not represented. Therefore, performance depends critically on the quality of the feature choices.
Additionally, all of the above techniques are structured such that the entire statistical model must be evaluated against the input image to determine if the object is present. This can be time consuming and inefficient. In particular, since the object can appear at any position and any size within the image, a detection decision must be made for every combination of possible object position and size within an image. It is therefore desirable to detect a 3D object in a 2D image over a wide range of variation in object location, orientation, and appearance.
It is also known that object detection may be implemented by applying a local operator or a set of local operators to a digital image, or a transform of a digital image. Such a scheme, however, may require that a human programmer choose the local operator or set of local operators that are applied to the image. As a result, the overall accuracy of the detection program can be dependent on the skill and intuition of the human programmer. It is therefore desirable to determine the local operators or set of local operators in a manner that is not dependant on humans.
Finally, even with very high speed computers, known object detection techniques can require an exorbitant amount of time to operate. It is therefore also desirable to perform the object detection in a computationally advantageous manner so as to conserve time and computing resources.