The exemplary embodiment relates to object localization and finds particular application in a system and method for detection of prominent objects in images.
Object localization finds application in a variety of tasks, such as image thumbnailing, product identification on small screens, categorization, and information extraction. Thumbnailing is used to produce a ‘thumbnail’ of an image which focuses on a region of the image. Rather than reducing the size of the whole image, it tends to be more informative if the thumbnail shows the prominent object. Product identification is useful on smartphones, for example, where the size of the screen makes it difficult to see an object which occupies a small portion of the screen. Fine-grained categorization is a challenging many-class vision task where the goal is to assign an image of an “entry-level” class to its specific sub-class (e.g., to classify 200 bird species or 120 dog breeds). Detecting the prominent subject first generally has a positive impact in the categorization.
In object localization, the input image is expected to contain an object (or more than one), and the aim is to output information about the location of each object, such as the rectangle that tightly encompasses the object. In some instances, it may be desirable to identify a single, prominent object. The definition of “prominent object” may be specified by examples: for example, training dataset of images may be provided with their corresponding annotations of the true object locations.
One efficient method to detect prominent objects in images uses a similarity-based approach, as described in J. A. Rodriguez-Serrano, et al., “Predicting an object location using a global image representation,” ICCV, pp. 1729-1736, 2013, hereinafter, Rodriguez-Serrano 2013, and in U.S. Pub. Nos. 20140056520 and 20130182909. This method, referred to herein as data-driven detection (DDD), encodes each image using a single, spatially-variant feature vector, and expresses detection as a “query” to an annotated training set: given a new image, the method first computes its similarity to all images in the training set, then selects the images with highest similarity (the “neighbors”). The output rectangle denoting the location of the object in the query image is computed as a simple function of the neighbors' rectangles. The method is efficient for several object detection tasks, since it expresses detection as an image retrieval task and leverages a fast image retrieval implementation. Other methods are based on probability maps, as described, for example, in U.S. Pub. No. 20140270350.
The success of the DDD method relies on several factors, such as:
1. Localized representation: for similarity-based localization, two feature vectors that represent two objects (of the same class but slightly different appearance) in the same location have to be very similar. Hence, the feature encoding needs to be sensitive to the location of the object. This is in contrast with other computer vision tasks, such as image categorization, where features need to encode the appearance and yet be robust to changes in location. DDD achieves the encoding of location information by using Fisher vectors with dense spatial pyramids or probability maps, which can be seen as very localized mid-level features.
2. Supervised or semantic representation: While DDD obtains good results when used with standard features, such as Fisher vectors, an improvement can be obtained when the representation involves some supervised learning to add task-related information or semantics through mid-level attributes. In one variant of DDD, supervision is achieved by adopting a “learning to rank” approach, where a metric learning algorithm finds a projection of the data which improves the retrieval accuracy. In another variant, a mid-level feature is built from the output of a localized foreground/background classifier, discriminatively learned on the training set. The latter can be interpreted as very localized attributes.
3. Compression: Since DDD expresses detection as a retrieval problem, compression of the feature vectors is beneficial in order to ensure a fast lookup. DDD offers two variants. In the case of metric learning, a low-rank metric learning algorithm is used which allows working in a projected (lower-dimensional) subspace. In the case of probability maps, they are already a compact representation (e.g., a maximum of 2500 dimensions, which is several orders of magnitude smaller than standard Fisher vectors).
While the DDD system provides a very good performance, comparable to methods such as the DPM method of Felzenszwalb, “Object Detection with Discriminatively Trained Part Based Models,” TPAMI, 32(9), 1627-1645, 2010, it would be advantageous to provide a method of object localization which takes spatial information into account effectively in generating a global representation of the image while also providing a compact representation.