Hierarchical approaches to generic object recognition have become increasingly popular over the years. These are in some cases inspired by the hierarchical nature of primate visual cortex (LeCun, Yann et al., “Learning methods for generic object recognition with invariance to pose and lighting,” in Proceedings of CVPR (Computer Vision and Pattern Recognition) '04, IEEE Press, 2004 and Wersing, H. and E. Korner, “Learning optimized features for hierarchical models of invariant recognition,” Neural Computation 15(7), 2003), but, most importantly, hierarchical approaches have been shown to consistently outperform flat single-template (holistic) object recognition systems on a variety of object recognition tasks (Heisele, B. et al., “Categorization by learning and combining object parts,” in NIPS (Neural Information Processing Systems), Vancouver, 2001). Recognition typically involves the computation of a set of target features (also called components, parts see Weber, M. et al., “Unsupervised learning of models for recognition,” in ECCV (European Conference on Computer Vision), Dublin, Ireland, 2000) or fragments (see Ullman, M. et al., “Visual features of Intermediate complexity and their use in classification,” Nature Neuroscience 5(7): 682-687, 2002) at one step and their combination in the next step. Features usually fall in one of two categories: template-based or histogram-based. Several template-based methods exhibit excellent performance in the detection of a single object category, e.g., faces (Viola, P. and M. Jones, “Robust real-time face detection,” in ICCV (International Conference on Computer Vision) 20(11):1254-1259, 2001), cars (Schneiderman, H. and T. Kanade, “A statistical method for 3D object detection applied to faces and cars,” in CVPR (IEEE Convention on Computer Vision and Pattern Recognition), pp. 746-671, 2000) or pedestrians (Mohan, A. et al., “Example-based object detection in images by components,” in PAMI (IEEE Transactions on Pattern Analysis and Machine Recognition), 23(4):349-361, 2001). Constellation models based on generative methods perform well in the recognition of several object categories (Fergus, R. et al., “Object class recognition by unsupervised scale-invariant learning,” in CVPR, 2:264-271, 2003), particularly when trained with very few training examples (Fei-Fei, L. et al., “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” in CVPR, Workshop on Generative-Model Based Vision, 2004).
One limitation of these rigid template-based features is that they might not adequately capture variations in object appearance: they are very selective for a target shape but lack invariance with respect to object transformations. At the other extreme, histogram-based descriptions (Lowe, D. G., “Object recognition from local scale-invariant features,” in ICCV, pp. 1150-1157, 1999; and Belongie, S. et al., “Shape matching and object recognition using shape contexts,” PAMI, 2002) are very robust with respect to object transformations. The SIFT-based features of Lowe, for instance, have been shown to excel in the re-detection of a previously seen object under new image transformations. However, with such degree of invariance, it is unlikely that the SIFT-based features could perform well on a generic object recognition task.