When a person enters a room that the person has never seen before, the person's visual system immediately begins to parse the scene. The eyes move (saccade) to regions of the room that contain objects of interest, and as these objects are found, the brain immediately begins classifying them. If the person sees something new and unrecognized, the person might ask a friend what the item is called. While this task is trivial for most humans to accomplish, it has proven to be a very challenging problem to get computers to perform well. Because human performance far exceeds that of the best machine vision systems to date; building an artificial system inspired by the principles underlying human vision has been an attractive idea since the field of computer vision was conceived. However, most of the bio-inspired systems only incorporate one aspect of vision, have not been robustly tested on real-world image datasets, and/or are not suited for real-time applications. The majority of research in machine vision has dealt with individual problems, such as recognizing or segmenting objects from a scene. Much less work has been done in ascertaining the best way to combine various vision algorithms.
Recently, numerous groups have constructed object recognition algorithms capable of accurately classifying over 100 distinct object categories in real-world image datasets. Much of this work has been tested using the Caltech-101 dataset, which consists of 101 classes of objects, each containing many images (see literature reference no. 1, below in the Detailed Description). This is a very hard dataset to get good performance on. Because each class contains a variable number of images, the standard procedure in reporting results with this dataset is to calculate the average performance for each class and then calculate the mean accuracy. A failure to do so gives results that are overly-optimistic because some of the easier classes contain more images than some of the harder ones. All of the results on this dataset are determined in this manner.
One of the best non-biologically inspired systems, developed by Berg, achieves 48 percent accuracy on the Caltech-101 dataset using fifteen training images per class and normalizing the results (see literature reference no. 2). Berg's method represents shape by sampling 400 pixel locations from the output of an edge detecting algorithm, which are chosen because they have “high edge energy.” The algorithm then uses geometric blur to determine corresponding points on two shapes, and a custom classifier that uses binary quadratic optimization to obtain a correspondence between an input and data stored in the classifier.
Lazebnik et al. achieved excellent results on the Caltech-101 dataset using spatial pyramid matching kernels (see literature reference no. 3). They attained 56.4 percent also using fifteen images per class. Their algorithm uses scale invariant feature transform (SIFT) descriptors as features that are fed into a spatial pyramid matching kernel (see literature reference no. 4). This kernel allows for precise matching between two collections of features in a high dimensional space, while preserving some spatial information. Support vector machines (SVMs) are then used for classification (see literature reference no. 5).
Hierarchical Model and X (HMAX) is the foremost bio-inspired visual feature extraction architecture (see literature reference nos. 6 through 9). It has been primarily used in conjunction with a SVM classifier on the Caltech-101 dataset. This model is based on studies of visual receptive fields found in cat and monkey visual cortex. One of the best implementations of HMAX achieves 51.2±1.2 percent accuracy when using fifteen images (see literature reference no. 9). While these results are good, they are currently too slow for real-time applications (see literature reference no. 8).
While each of the object recognition algorithms discussed above can only deal with images containing a single object, visual attention algorithms attempt to find interesting areas in a scene, which could contain many objects. Most of the visual attention algorithms that have been developed are feature-based (see literature reference nos. 10 and 11). These systems compute attention using a feature-based approach in which attended regions are determined by constructing a saliency map. Attention is paid to a series of specific locations in a visual scene as if a spotlight has been shined on particular regions in the image. The spotlight is nonspecific and can illuminate an object, a part of an object, a texture or lighting artifact, or nothing at all. Most feature-based methods cannot segment attended objects from the background. Also, in some of these algorithms, the attended regions have been shown to be variant under both translation and rotation of the scene (see literature reference no 12). This is an undesirable trait of a biologically inspired attention mechanism, since it makes little sense that the visual attention of a living creature would change dramatically when it tilts its head. Many of these problems could be eliminated by adopting an object based visual attention algorithm.
Other systems have been developed that integrate attention and object recognition. The Navalpakkam and Itti system can find objects in a visual scene, and for object recognition it constructs a hierarchical tree that stores features (see literature reference nos. 13 and 14). When a new feature is to be classified, their system searches this tree for the closest matching feature. This approach does not take shape directly into account, which may be vital to successfully classifying an object. Additionally, it is unclear how well their system will scale when it needs to distinguish among a large number of object classes.
Thus, a continuing need exists for a Visual Attention and Object Recognition System (VARS), which combines the ability to find objects in a scene with the power to accurately classify those objects and, which can be configured to request the correct identify of an object it is unfamiliar with.