Computer vision strives to duplicate the abilities of human vision by electronically perceiving and understanding an image. Fine-grained recognition refers to the task of distinguishing subordinate categories, such as bird species, dog breeds, aircraft, or car models. Annotation has proven useful in fine-grained recognition and other fields. In this regard, part annotation (e.g., for a keypoint or bounding box around a semantic part) has proven particularly useful. For example, given an image of a particular object (e.g., a bird), a user may want to identify where various parts are for the object (e.g., the bird's head, beak, wing, feet, and eyes).
Although, annotations for various visual attributes (e.g., color) may be available, annotations for the location of these parts is lacking. In a manual approach, a user is required to hand-annotate where each of these parts are located in the image. Unfortunately, for a large collection of images, this hand-annotation process is extremely time-consuming and cost-prohibitive.
The most common automated approach is to generate a large set of proposed parts and train classifiers to predict local attributes at each proposed part. In this way, the proposed part that best predicts the attributes at a particular semantic location is determined to be that semantic location (e.g., if one part proposal is the best predictor of wing color, then that part proposal is classified as a bird wing). Unfortunately, this approach ignores the large correlation between attributes at different semantic parts and the part proposal is often incorrectly classified.