The exemplary embodiment relates to object detection and finds particular application in connection with a system and method for approximating window-level operations at the patch level for reducing computation time.
For many applications, the ability to detect and locate specific objects in images provides useful information. Given an image and a predefined set of semantic classes, e.g., “car” and “pedestrian,” the goal is often to output all regions that contain instances of the considered class or classes. These image regions are most commonly predicted by rectangles referred to as bounding boxes or windows. The bounding boxes can vary in size and aspect ratio, depending on the anticipated size and shape of the object. Object detection is a challenging task, due in part to the variety of instances of objects of the same class, to the variety of imaging conditions (viewpoints, environments, lighting), and to the scale of the search space (typically millions of candidate regions for a single image).
Object detection finds application in transportation services, such as in locating license plates or vehicles in images captured by a toll-plaza or car park camera. In retail businesses, detecting and counting specific objects in store shelves would enable applications such as planogram compliance or out-of-stock detection. However, many existing object detection systems are not sufficiently accurate or efficient enough for practical application.
Existing object detection algorithms cast detection as a binary classification problem: given a candidate window and a candidate class, the goal is to determine whether the window contains an object of the considered class, or not. This generally includes computing a feature vector describing the window and classifying the feature vector with a binary classifier, e.g., a linear SVM. A sliding window may be used to scan a large set of possible candidate windows. In this approach, a window is moved stepwise across the image in fixed increments so that a decision is computed for multiple overlapping windows. In practice, this approach uses windows of different sizes and aspect ratios to detect objects at multiple scales, with different shapes, and from different viewpoints. Consequently, millions of windows are tested per image. Building window-level feature vectors generally involves the aggregation of patch descriptors for a set of patches constituting the window and applying costly non-linearities to the aggregated vectors. The computational cost is, therefore, one of the major impediments to practical implementation.
Extracting patch-based features from the candidate windows is described, for example, in Cinbis, et al., “Segmentation Driven Object Detection with Fisher Vectors,” IEEE Intern'l Conf. on Computer Vision, pp. 2968-2975 (2013), and is one of the most successful approaches to detection. The method generally includes extracting a set of patches from the candidate window, typically on a dense multi-scale grid (the number of patches depends on the window size), computing one low-level (patch) descriptor per patch, encoding each patch descriptor using a non-linear mapping function that embeds the low-dimensional patch descriptors into a higher-dimensional space, aggregating these patch encodings using a pooling function, typically by summing the patch encodings, applying a set of normalization steps to the pooled representation, such as a power transformation or l2-normalization, and classifying the resulting window representation, for example with a linear classifier. As the non-linear mapping function, a Fisher Vector encoding function may be used.
Since the same patch may appear in many candidate windows, some of the patch-level computation may be re-used for different windows, thereby reducing computational cost. Accordingly, the first steps in the method may include computing a patch descriptor for each patch in the image, encoding the patch descriptor, and storing the encoding. Then for each window, the encodings for the patches it contains are retrieved and then pooled and normalized, as described above.
Unfortunately, even when precomputing the patch-level encodings, the cost of the detection procedure remains prohibitive for many practical applications. For example, it may take up to several minutes per image and per candidate class, due to the window-level operations. Solutions to mitigate this problem have been proposed. However, such methods tend to have drawbacks, such as compromising the classification accuracy, requiring unrealistic assumptions to be made (e.g., a maximum of one object per image), or only marginally reducing the computational cost. For example, in one method, the patches are individually scored prior to aggregation at the window level, ignoring the costly normalization step. See, for example, Chen, et al., “Efficient maximum appearance search for large-scale object detection,” Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR13), pp. 3190-3197 (2013). While reducing cost, this comes at the expense of a significant loss in performance.
There remains a need for a system and method for object detection which reduces computational cost without significantly reducing detection performance.