It is generally desirable for machine vision applications to be as accurate as possible while operating within reasonable computational constraints. Conventional systems generally rely on color/texture analyses that require targets objects to possess one or more highly distinctive, local features that can be used as distinguishing characteristics for a classification algorithm. Many objects, however, consist of materials that are widely prevalent across a wide variety of object categories. Much less effort has been made to characterize objects based on shape, or the particular way the component features are arranged relative to one another in two-dimensional (2D) image space. Accordingly, applying an approach that characterizes objects based on shape/contour may be beneficial. Moreover, less effort still has been made to characterize objects based on their motion properties, such as velocity. Applying an approach that characterizes objects based on perceived motion may also be beneficial. Furthermore, applying a single approach that characterizes objects using one or all of the three of the aforementioned methods may be beneficial.
It has been shown that hierarchical, or deep, models for solving computer vision problems are more generally advantageous than traditional flat architectures. Nearly all existing hierarchical approaches to computer vision are exclusively bottom-up or feed-forward in character. In such models, information flows in only one direction and each subsequent layer can be trained only after the previous layers have been completely learned. A fundamental disadvantage of this exclusively bottom-up or feed-forward approach is that the features that have been previously learned by a given layer in the hierarchy cannot be modified to take into account what is subsequently learned by succeeding layers. As a result, exclusively bottom-up/feed-forward networks contain a large amount of redundancy, with the same information being represented at each stage in the hierarchy. To reduce redundancy in the system, it may be preferable if instead, all of the layers in the hierarchy could be learned simultaneously in a competitive manner such that the information extracted by one layer was not redundant with any other layer, but rather encoded as unique information.
Another problem typically encountered in standard approaches to training hierarchical networks for solving computer vision tasks is that the dimensionality of underlying feature space will often increase from one layer to the next. This increase in dimensionality occurs because each subsequent layer in a hierarchical network receives convergent inputs from a spatial neighborhood of feature detectors located in the previous layer and because there are, in theory, a combinatorially large number of ways of combining spatially-distributed features. Thus the outputs of any given layer are typically of a higher dimensionality than its inputs. Spatial convergence is vital, however, for enabling hierarchical networks to learn feature detectors of increasing complexity and increasing viewpoint invariance at successively higher processing stages. Mathematically, hierarchical networks for solving computer vision tasks should, in general, need to contain progressively more neurons in each subsequent hierarchical layer in order to capture the increased dimensionality and complexity of their inputs. However, it is generally impractical to increase the size of each layer in a hierarchy ad infinitum, as the number of feature detectors in each subsequent layer would grow exponentially.
In a deep, hierarchical network, there is an incentive to introduce additional invariance at each subsequent layer. Traditionally, invariance is encoded into computer vision systems using a max or mean pooling operation, or some analogous procedure such as constructing a histogram of local activity levels. In this approach, layers are sub-divided into two sub-stages, a first stage including feature detectors that respond selectively to a particular pattern of inputs, and a second, typically smaller, stage of invariant detectors that pool over a small neighborhood of selective feature detectors in the first stage. Such pooling serves to reduce the dimensionality of the overall output of the layer and to introduce a small amount of additional invariance to local translations of the features or objects to be detected.
However, an approach based on max or mean pooling has been repeatedly shown to fail when scaling up from model datasets to real world problems. Additionally, max or mean pooling fails to account for any non-linear transformations that objects typically undergo, such as changes in viewpoint or shading. However, some mechanism for incrementally increasing invariance is desirable. A scalable, general scheme for incrementally increasing the invariance of the representations encoded at each layer in a visual processing hierarchy may be beneficial to the construction of computer vision systems for viewpoint invariant object detection.
Conventional computer vision solutions often perform color/texture analysis or shape/contour analysis. Traditionally, these solutions are viewed and compared independently. A composite approach that combines an improved shape/contour detection algorithm and an improved color/texture analysis algorithm may be more beneficial. Also, a single deep, sparse, hierarchical network that analyzes both color/texture and shape/contour features simultaneously may be desirable. Furthermore, conventional systems for solving computer vision problems generally require immense processing and memory resources. Accordingly, an approach that is amenable to hardware that requires less power to run while maintaining computational speed and accuracy may be beneficial.