Computer vision allows computing systems to understand an image or a sequence of images (e.g., video) by extracting information from the image. The ability of a computing system to accurately detect and localize objects in images has numerous applications, such as content-based searching, targeted advertisements, and medical diagnosis and treatment. It is a challenge, however, in object recognition methods and systems, to teach the computing system to detect and localize particular rigid or articulated objects in a given image.
Object recognition methods and systems operate based on a given set of training images that have been annotated with the location and type of object shown in an image. However, gathering and annotating training images is expensive, time consuming, and requires human input. For example, images of certain object types may be gathered using textual queries to existing image search engines that are filtered by human labelers that annotate the images. Such approaches are expensive or unreliable for object localization and segmentation because human interaction is required to provide accurate bounding boxes and segmentations of the object. Alternatively, algorithms requiring less training data may be used for object localization and segmentation. The algorithms identify particular invariant properties of an object to generalize all modes of variation of the object from existing training data. However, the accuracy of object recognition systems increases with the amount of training data. Accordingly, it is a challenge to develop large enough training sample sets to obtain satisfactory results.