The present system relates to fine-grained image classification of real-world physical objects.
Recognition of objects from real-world images is an important task that has many applications, including smart car, security, learning associated with physical structures and products, among others. Fine-grained image classification needs to discern subtle differences among similar classes. The majority of existing approaches have thus been focusing on localizing and describing discriminative object parts in fine-grained domains. Various pose-normalization pooling strategies combined with 2D or 3D geometry have been preferred for recognizing birds. The main drawback of these approaches is that part annotations are significantly more challenging to collect than image labels. Instead, a variety of methods have been developed towards the goal of finding object parts in an unsupervised or semi-supervised fashion.
To provide good features for recognition, another prominent direction is to adopt detection and segmentation methods as an initial step and to filter out noise and clutter in background. However, better feature through segmentation always comes with computational cost as segmentation is often computationally expensive.
While most existing works focus on single-label classification problem, it is more natural to describe real world images with multiple labels like tags or attributes. According to the assumptions on label structures, previous work on structural label learning can be roughly categorized as learning binary, relative or hierarchical attributes.
Much prior work focuses on learning binary attributes that indicate the presence of a certain property in an image or not. For instance, previous works have shown the benefit of learning binary attributes for face verification, texture recognition, clothing searching, and zero-shot learning. However, binary attributes are restrictive when the description of certain object property is continuous or ambiguous.
To address the limitation of binary attributes, comparing attributes has gained attention in the last years. The relative-attribute approach learns a global linear ranking function for each attribute, offering a semantically richer way to describe and compare objects. While a promising direction, a global ranking of attributes tends to fail when facing fine-grained visual comparisons. One existing system provides learning local functions that tailor the comparisons to neighborhood statistics of the data. A Bayesian strategy can infer when images are indistinguishable for a given attribute.