The present disclosure relates to digital data processing and, in particular, object recognition in images.
Extracting useful features from a scene is an essential step of any computer vision and multimedia analysis tasks. In the field of neuroscience, a theory for image recognition has been established by D. Hubel and T. Wiesel in their paper titled “Receptive fields and functional architecture of monkey striate cortex” (The Journal of Physiology, 195(1):215, 1968). Many recent models for extracting features from images to recognize objects are founded on their theory that visual information is transmitted from the primary visual cortex (V1) over extra striate visual areas (V2 and V4) to the inferior temporal cortex (IT), as illustrated in FIG. 1. IT in turn is a major source of input to the prefrontal cortex (PFC), which is involved in linking perception to memory and action. The pathway from V1 to IT, which is called the visual frontend, consists of a number of simple and complex layers. The lower layers attain simple features that are invariant to scale, position and orientation at the pixel level. Higher layers detect complex features at the object-part level. Pattern reading at the lower layers are unsupervised; whereas recognition at the higher layers involves supervised learning. Computational models have been proposed by Serre (T. Serre. Learning a dictionary of shape-components in visual cortex: comparison with neurons, humans and machines. PhD thesis, Massachusetts Institute of Technology, 2006.), Lee (H. Lee, R. Grosse, R. Ranganath, and A. Ng. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In International Conference on Machine Learning, 2009) and Ranzato (M. Ranzato, F. Huang, Y. Boureau, and Y. LeCun. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In Computer Vision and Pattern Recognition (CVPR), 2007) show such a multi-layer generative approach to be effective in object recognition. On the other hand, some heuristic-based signal-processing approaches have also been proposed to extract features from images. These two different approaches can both generate several numerical representations for an image when extracting the object features from the image.
Statistics from evaluations on these two approaches through the same image-labeling task reveals the following results: first, when the number of training instances is small, the model-based approach outperforms the heuristic-based; second, while both feature sets commit prediction errors, each does better on certain objects—neuroscience-based tends to do well on objects of a regular, rigid shape with similar interior patterns, whereas the heuristic-based model performs better in recognizing objects of an irregular shape with similar colors and textures; third, for objects that exhibit a wide variety of shapes and interior patterns, neither model performs well. The first two observations confirm that feature extraction considers both feature invariance and diversity. A feed-forward pathway model designed by Poggio's group (M. Riesenhuber and T. Poggio, Are Cortical Models Really Bound by the Binding Problem, Neuron, 24(1):87-93, 1999) holds promises in obtaining invariant features. However, additional signals must be collected to enhance the diversity aspect. As Serre indicates, feedback signals are transmitted back to V1 to pay attention to details. Biological evidence suggest that a feedback loop in visual system instructs cells to “see” local details such as color-based shapes and shape-based textures.