1. Field of the Invention
This invention pertains generally to visual recognition, namely a collection of inference tasks performed by exploiting imaging data.
2. Description of Related Art
Currently the majority of new mobile phones are configured with a camera, which allows users to capture, process, record and transmit still images and video. In addition to the traditional use of a camera, integration with several sensors in mobile phones provides location-based services by retrieving specialized information for the location and orientation input from the Global Positioning System (GPS) and/or gyrometers and/or accelerometers. Some applications overlay this information on top of live video frames from the camera in the display, letting users have augmented reality experiences. Displays may include built-in screens, monitors, or other display, including remote displays such as goggle-mounted or other wearable display.
Although these location-based services have greatly increased public usability of mobile platforms, they lack the ability for utilizing the vision capability of such devices and there is no understanding of or relation to the actual scene and objects that the user is targeting with the device. This not only reduces the quality of visually proper registration of the information, but also limits the capability of interaction between the user and the scene objects using the information.
One primary difficulty with recognizing objects and scenes from images is the large nuisance variability that the data can exhibit, depending on the vantage point (position and orientation of the object or scene relative to the sensor), visibility conditions (occlusions), illumination, and other variations under which the object is seen, even if it does not exhibit intrinsic variability. In addition, intra-class variability can add to the complexity of the task. It is known that nuisance variability comprises almost the entirety of the variability in the data, as what remains after viewpoint and contrast variability is factored out and is supported on a zero-measure subset of the image domain. The most common approach to deal with nuisance variability is to eliminate it or reduce its effects by pre-processing the data to obtain “insensitive” and yet “distinctive” features, and to “learn away” the residual nuisance variability using generic tools from Machine Learning, often using a training set of manually labeled images. Both practices are poorly grounded in principle for several reasons such as: (1) pre-processing does not generally improve classification performance, as established by the Data Processing Inequality; and (2) training a classifier using collections of isolated snapshots (single images) of physically different scenes or objects brings into question the fact that there is a scene from which images are generated, and limits the classifier to learning generic regularities in images rather than specific and distinguishing features of the scene or objects. This is because the complexity of the scene is infinitely larger than the complexity of the images. Indeed, it can be shown that, when a visual recognition system is built and trained using a collection of passively gathered independent snapshots, not only is the worst-case error that can be guaranteed at chance level (i.e., the expected probability of error, a.k.a. risk, is the same that is given by the prior probability), but so is the average case. This is not the case, however, when the training data consists of multiple purposefully captured images of the same scene during an active exploration phase.