Object recognition is an increasingly important area of computer vision. Object recognition has a wide range of practical applications such as, for example, commerce, image searching, image archiving, image retrieval, image organization, manufacturing, security, and the like.
Many objects, such as apparel, are defined by shape. For example, boots and sandals are distinguished from each other by shape; however, accurate object recognition is often difficult due to imaging conditions that change due to external and internal factors. External factors include illumination conditions (for example, back-lit versus front-lit or overcast versus direct sunlight) and camera poses (for example, frontal view versus side view). In the field of pattern recognition, variations imaged objects exhibit due to varying imaging conditions are typically referred to as intra-class variations.
The ability to recognize objects across intra-class variations determines success in practical applications. A feature common to object recognition is a similarity measure—where objects are considered similar if they belong to the same class. The similarity measure can be used to verify that two object images belong to the same class or to classify images by determining to which of the given objects the new example is most similar; however, designing a good similarity measure is difficult.
Simple similarity measures such as those based on the Euclidean distance used directly in the image space do not typically work well because the image can be affected more by the intra-class variations than by inter-class variations. Therefore, object recognition should be able to extract the image features that maximize the inter-class differences relative to the intra-class differences.
More recently, object recognition has been driven to a great extent by the advances in texture-based descriptors, such as scale-invariant feature transform (SIFT) (see D. Lowe, “Distinctive Image Features from Scale-Invariant Keypoints”, 2 International Journal of Computer Vision 60 (2004)) and histograms of oriented gradients (HoG) (see N. Dalai and B. Triggs, “Histograms of Oriented Gradients for Human Detection”, 1 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), 886-893 (2005)). These descriptors capture local or semi-local object information, focus on high-frequency edge information, and can be discriminative.
Although SIFT and HoG have had some success on recognition tasks over a variety of object types (see L. Fei-Fei, R. Fergus, and P. Perona, “Learning Generative Visual Models from Few Training Examples: An Incremental Bayesian Approach Tested on 101 Object Categories”, 106 Comput. Vis. Image Underst. 1 (2007); M. Everingham et al., “The 2005 Pascal Visual Object Classes Challenge”, In Selected Proceedings of the First PASCAL Challenges Workshop, 3944 Lecture Notes in Artificial Intelligence 117 (2005)), SIFT and HoG tend to perform poorly on weakly-textured object or objects with variable appearance. For example, this is true for many man-made objects such as furniture, bottles, cups, apparel, etc.
Shopping for apparel is an important business on the Internet; however, visual searching for similarity or style is still a largely unexplored topic. In many clothing items, shape is a defining feature. Many of the techniques focus on describing contours or sets of contours. A common approach is to search for a set of connected contours, which explains most of the object boundary in an image. Examples include contour networks (see V. Ferrari, T. Tuytelaars, and L. V. Gool, “Object Detection by Contour Segment Networks”, Computer Vision—ECCV, 14-28 (2006)); shape trees (see P. Felzenszwalb and J. Schwartz, “Hierarchical Matching of Deformable Shapes”, Computer Vision and Pattern Recognition (2007)); the particle-based search algorithm by Lu et al. (C. Lu, L. J. Latecki, N. Adluru, X. Yang, and H. Ling, “Shape Guided Contour Grouping with Particle Filters”, Proc. IEEE ICCV (2009)); and simultaneous object detection and segmentation by Toshev et al. (A. Toshev, B. Taskar, and K. Daniilidis, “Shape-Based Object Detection via Boundary Structure Segmentation”, International Journal of Computer Vision (IJCV) (2010)).
Such approaches capture whole contours instead of sparse point configuration. In addition, such approaches target at segmenting the object, which requires an inference—which can be costly as such approaches are not tractable. Further, small boundary fragments have been used as weak classifiers with boosting and subsequent voting. See A. Opelt, A. Pinz, and A. Zisserman, “A Boundary-Fragment Model for Object Detection”, European Conference on Computer Vision (2006); J. Shotton, A. Blake, and R. Chipolla, “Contour-Based Learning for Object Detection”, International Conference on Computer Vision (2005). Boosting refers to learning meta-algorithms for performing supervised learning. Supervised learning refers to machine learning task of inferring a function from supervised (labeled) training data.
Thus, it is a common user experience for a search for given object to draw into the search results a multitude of unrelated objects. For example, a search for the shoe object class “boots” may uncover multiple search results that do not relate to the apparel boots. Accordingly, users waste valuable time sifting through this unrelated content.