Object detection in consumer images is an important image analysis task. In particular, an algorithm that could detect and recognize objects in images would allow a computer to automatically extract a large amount of semantic information from an image, in effect simulating what a human sees when viewing an image. The semantic information could be employed to improve upon a wide range of image understanding applications, such as automatic image categorization, scene classification, image orientation determination, etc.
Despite years of research attention, there has been little success in creating a single computer algorithm that can reliably detect an arbitrary object in unconstrained images. The best that can be attained in the current state-of-the-art is to build separate algorithms for specific objects or classes of objects or under certain conditions, e.g. faces (M.-H. Yang, D. Kriegman, N. Ahuja. Detecting Faces in Images: A Survey. In IEEE Transactions on Pattern Recognition and Machine Intelligence, vol. 24:1, pp. 34-58, 2002), human bodies (N. Sprague and J. Luo. Clothed People Detection in Still Images. In Proceedings of the International Conference on Pattern Recognition, 2002), horses (D. A. Forsyth and M. M. Fleck. Body Plans. In Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 1997), license plates (J.-W. Hsieh, S.-H. Yu, Y.-S. Chen. Morphology-based License Plate Detection from Complex Scenes. In Proceedings of the International Conference on Pattern Recognition, 2002), cars in satellite photos (H. Moon, R. Chellappa, A. Rosenfeld. Optimal Edge-Based Shape Detection. In IEEE Transactions on Image Processing, (11) 11, Nov. 2002), road signs (Y. Lauziere, D. Gingras, F. Ferrie. A Model-Based Road Sign Identification System. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2001).
Building object detection algorithms is typically time-consuming and labor-intensive. There are two basic approaches that are often taken to building a detection algorithm for a new object or object class. The first is to collect a large amount of image data containing the object and train a learning engine on the ground truth data (e.g. H. Schneiderman and T. Kanade. A Statistical Method for 3D object detection applied to faces and cars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2000, and H. Rowley, S. Baluja, T. Kanade. Rotation Invariant Neural Network-Based Face Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1998). However, collecting a large amount of ground truth is time-consuming and for some objects may be difficult or impossible. Also, a significant amount of human effort is required to design the learning engine and select appropriate image features. Another approach is to use human intuition to write rules for finding the object. Unfortunately, this is also labor intensive and requires experts rather than simply operators, and the resulting detector is very specialized: a completely new set of new rules must be written for each type of object.
This invention considers the detection of compound color objects, which we define as objects having a specific set of multiple colors that are arranged in a unique and constant spatial layout, subject to global and local deformations that change the appearance of the object in an image. This is a relatively wide class of objects that includes, for example, flags, cartoon characters, logos, uniforms, signs, etc. This problem is non-trivial because the appearance of compound color objects may vary drastically from scene to scene. Objects like flags and logos often appear on flexible material, and their appearances change as the material distorts. For example, a flag is subject to self-occlusion and non-affine distortion depending on wind conditions. Since orientation of images is not always known and many compound color objects do not have fixed orientations, the detector must be invariant to rotation. It should also be robust to color shifts due to illuminant changes and color differences from object to object.
In the design of any object detection system, one must choose a suitable representation that is used for comparing the object model to an input image. The choice of representation is typically a function of the types of distortions that are expected in the object across different images. For example, if one expects dramatic color variations in an object, a representation based on image edges might be chosen (e.g., Moon, Chellappa and Rosenfeld), while if dramatic spatial variations are expected, a representation using global color histograms might be wise (e.g. M. Swain and D. Ballard. Color Indexing. International Journal of Computer Vision, (7) 1, pp. 11-32, 1991). There is a continuum of possible representations depending on the degree of spatial distortion that can be accommodated. On one end of the continuum is pixel-by-pixel template matching. This approach is used for rigid objects (e.g., face detection). On the other end of the continuum are flexible models that decompose an object into its component parts and capture the possible spatial relationships between them. As one moves from the former end of the continuum to the latter end, the approaches become much more flexible in the types of distortions that they can handle. However, they also tend to require more high-level knowledge about the target object and become more susceptible to false alarms. An approach near the latter end of the spectrum is necessary for objects whose spatial arrangements can change significantly (e.g., human pedestrians). For our compound color object detection problem, an approach somewhere in the middle is required. By definition, the spatial layout of a compound color object is fixed, but distortions may still occur due to camera angle and projection of the object on a non-rigid surface, like flags and logos on fabric.
Object detection is a fundamental problem in computer vision and has received a large amount of attention in the literature. As mentioned above, there is a spectrum of different approaches to object recognition, depending upon the level of abstraction at which object matching is performed. Major relevant object detection work found in the literature is highlighted here. The work is listed in order of increasing levels of abstraction.                Rowley et al. detect faces using template matching on the intensity plane of an image. Pre-processing is applied to input images to correct for lighting variations and to boost contrast. Image regions are classified as face or non-face using a neural network classifier applied directly on the luminance pixel values. The neural network was trained with approximately 10,000 ground-truth images.        Schneiderman and Kanade detect faces in images using joint histograms of wavelet features. Their statistical approach allows some robustness to variation in facial appearances, such as different angles of face orientation.        Oren et al. (M. Oren, C. Papageorgiou, P. Sinha, E. Osuna, T. Poggio. Pedestrian Detection Using Wavelet Templates. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1997) use wavelet features to detect pedestrians in images. The input image is scanned for pedestrians using windows of different sizes and classified using a Support Vector Machine.        Selinger and Nelson (A. Selinger, R. C. Nelson. Appearance-based Object Recognition Using Multiple Views. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2001) represent 3-D objects by several 2-D images taken from different angles. The 2-D images are further abstracted as groups of contour curves. Recognition is performed by exhaustive template matching of the curves.        Huttenlocher et al. (D. P. Huttenlocher, G. A. Klanderman, and W. J. Ricklidge. Comparing Images Using the Hausdorff Distance. In IEEE Transactions on Pattern Analysis and Machine Intelligence, (15) pp. 850-863, 1993) represent objects using edge pixel maps and compare images using the Hausdorff distance between the locations of edge pixels. The Hausdorff distance allows more tolerance to geometric distortion than simple pixel-by-pixel template matching.        Fan et al. (L. Fan, K.-K. Sung, T.-K. Ng. Pedestrian registration in static images with unconstrained background. In Pattern Recognition, 36 (2003), pp. 1019-1029, 2003) represent outlines of pedestrians using a series of feature points and line segments. A feature-based image warping technique is used to account for variability in pedestrian appearance.        Cootes et al. (T. F. Cootes, G. J. Edwards, and C. J. Taylor. Active appearance models. In Proceedings of the European Conference on Computer Vision, pp. 484-498, 1998) represent objects using active appearance models (AAMs) that model the shape and grayscale appearance of objects. The models allow detection of flexible objects, like faces.        Sprague and Luo detect people in images by grouping together segmented regions using characteristics like position, shape, size, color, and orientation according to a flexible model. A Bayesian network classifier is used.        Forsyth and Fleck use a similar approach to detecting horses in images. Their system segments an image into candidate horse regions using color and texture features and then assembles regions using a “body plan” to support the related geometric reasoning. Although powerful, these graphical model-based matching approaches require either a large amount of ground truth data to learn the allowable variability in an object's appearance, or require rules specified by the intuition of a human expert.        
In U.S. Patent No. 6,477,272 entitled “Object recognition with co-occurrence histograms and false alarm probability analysis for choosing optimal object recognition process parameters,” Krumm and Chang propose an object detection algorithm using color co-occurrence histograms, a feature that captures the colors within an object as well as some spatial layout information. They quantize multiple object models to a small number of colors using a k-means clustering algorithm and then quantize the test images using the same color clusters. They compute the color co-occurrence histogram of the model objects. The test image is scanned by computing the color co-occurrence histogram of large, overlapping regions which are compared to the model using histogram intersection. Object locations are refined by a hill-climbing search around regions exhibiting high similarity to the model during the rough scan. The disclosure focuses on detailed analysis for setting the parameters of the algorithm to minimize false alarms.
It must be noted that the method of Krumm and Chang was designed for images captured under very controlled conditions. Specifically, illumination conditions and camera settings need to be kept constant across all model and test images. The size and orientation of the objects also need to be the same across all model and test images. Such assumptions do not hold for unconstrained consumer images, where factors like illumination and object size can vary widely from image to image. In summary, it is clear that Chang and Krumm's approach would not generalize to unconstrained consumer images. The following shortcomings of their proposed algorithm are specifically identified:                It is not invariant to color shifts. It assumes controlled illumination conditions and is therefore unable to handle the color shifts typical across different consumer images.        It is not invariant to scaling. It assumes that target objects have a constant size with respect to the image dimensions.        It is not invariant to object orientation.        It assumes that the target object occurs exactly once in each test image. No facility is provided for processing images that contain zero or multiple target objects.        It relies on a hill-climbing strategy in which the object location is found by iteratively sliding the hypothesized object location towards the direction of best match. Such strategies are prone to falling into local maxima that are not globally optimal.        It uses a similarity metric that yields a high frequency of false alarms.        The computation demands of the algorithm are high.        
Consequently, there is therefore a need for a compound color object detection method that is easily deployable for most compound color objects. Instead of requiring a large number of exemplars or human intuition, the algorithm should work well with a single or a small number of model images. The algorithm should be easily redeployable for other compound objects by simply changing the model image. In particular, a need exists for a technique of object detection that overcomes the above-described drawbacks.