The exemplary embodiment relates to object recognition and finds particular application in connection with detection of products in a retail environment.
Object detection is a basic problem in image understanding and an active topic of research in computer vision. Given an image and a predefined set of objects or categories, the goal of object detection is to output all regions that contain instances of the considered object or category of objects. Object detection is a challenging task, due to the variety of imaging conditions (e.g., viewpoints, environments, and lighting conditions) and to the scale of the search space where millions of candidate regions often have to be considered for a single image.
Existing object detection algorithms often cast detection as a binary classification problem: given a candidate window and a candidate class, the goal is to determine whether the window contains an object of the considered class or not. This generally includes computing a feature vector describing the window and classifying the feature vector with a binary classifier, e.g., a linear Support Vector Machine (SVM) classifier. Since the candidate windows usually overlap, it is common for more than one candidate window to be placed over the same object instance. A non-maximum suppression step may be performed over all the scored candidates to remove the redundant windows before producing a final score.
A sliding window may be used to scan a large set of possible candidate windows. In this approach, a window is moved stepwise across the image in fixed increments so that a decision is computed for multiple overlapping windows. In practice, this approach uses windows of different sizes and aspect ratios to detect objects at multiple scales, with different shapes, and from different viewpoints. Consequently, millions of windows are tested per image. The computational cost is, therefore, one of the major impediments to practical detection systems.
In the retail environment, the ability to detect and count specific products on store shelves would facilitate many applications, such as counting products, identifying out-of-stock products, and measuring planogram compliance. However, there may be thousands of products that can appear in a shelf image, and shelf images can be of very high resolution. The standard exhaustive approach thus tends not to be practical for retail applications.
Two approaches have been proposed to address this problem. In the first, referred to as region selection, the set of windows that have to be classified is reduced by applying a selection mechanism. For example, one selective search algorithm may produce a few thousand candidate regions in images of a typical size (see, K. E. A. van de Sande, et al., “Segmentation as selective search for object recognition,” ICCV, 1879-1886 (2011), hereinafter, “van de Sande”). This algorithm has been successfully used in detection systems (see, Ramazan G. Cinbis, et al., “Segmentation driven object detection with Fisher vectors,” ICCV, pp. 2968-2975 (2013); and Ross Girshick, et al., “Rich feature hierarchies for accurate object detection and semantic segmentation,” CVPR, pp. 580-587 (2014)). Objectness methods have also been used as a selection mechanism (see, Bogdan Alexe, et al., “Measuring the objectness of image windows,” TPAMI, 34(11): 2189-2202 (2012); and Ming-Ming Cheng, et al., “BING: Binarized normed gradients for objectness estimation at 300 fps,” CVPR, pp. 3286-3293 (2014), hereinafter, “Cheng, et al.”). However, these methods are not well adapted to the retail domain, unlike the natural scenes for which they have been designed.
A second approach is referred to as detection by keypoint matching. This is an alternative approach to sliding-window detection in the case where the considered objects exhibit little intra-class variability (instance-level detection). The method involves detecting repeatable local descriptors which can be used to perform reliable matching between different views of the same object. Product identification in shelf images with computer vision tend to use such techniques (see, Michele Merler, et al., “Recognizing Groceries In Situ Using In Vitro Training Data,” CVPR, pp. 1-8 (2007); and Edward Hsiao, et al., “Making specific features less discriminative to improve point-based 3D object recognition,” CVPR, pp. 2653-2660 (2010)). Because of the invariance properties of such local descriptors, a few positive matches are typically sufficient to make a decision for standard instance-level detection. However to obtain repeatable invariant features, local descriptors are only extracted at a sparse set of keypoints, thus discarding a very significant amount of information. Losing such information is disadvantageous for fine-grained problems, such as product detection. Consequently, detection approaches based on keypoint matching tend to confuse similar products.
There remains a need for a system and method that allow identification of high quality candidate regions while discriminating between very similar products.