1. Field
Embodiments presented herein provide techniques for detecting objects in images and, more specifically, techniques which utilize a multi-level framework for detection and localization of objects in images.
2. Description of the Related Art
The appearance of an object in an image can change profoundly with pose, camera view, and interactions of the object with other objects. For example, the appearance of a person can change depending on the pose of the person; the camera view; and if the person is walking, riding a bike, etc. To deal with such variations, some traditional object detectors model multiple subcategories (e.g., multiple views, poses, etc. of a “person”) for each object category. However, these object detectors typically require the number of subcategories to be manually pre-defined. As a result, the number of subcategories may not adequately reflect the actual appearance variation of the object. In addition, some object detectors model interactions between objects, but typically ignore subtle joint appearance changes between objects caused by their interaction. For example, in an image of a person riding a bicycle, the appearance of both the bicycle and the person exhibit view-consistent appearance change, including the rider's legs occluding specific parts of the bicycle and the bicycle creating a highly textured background close to the rider's legs. This joint appearance change of the person and the bicycle, resulting from their interaction, is ignored by traditional object-detection models.