As the volume of digital multimedia collections grow, techniques for efficient and accurate labeling, searching and retrieval of data from those collections have become increasingly important. As a result, tools such as multimedia labeling and classification systems and methods that allow users to accurately and efficiently categorize and sort such data have also become increasingly important. Unfortunately, previous labeling and classification methods and systems tend to suffer deficiencies in several respects, as they can be inaccurate, inefficient and/or incomplete, and are, accordingly, not sufficiently effective to address the issues associated with voluminous collections of multimedia.
Various methods have been used to improve the labeling of multimedia data. For example, there has been work exploring the use of user feedback to improve the image retrieval experience. In some systems, relevance feedback provided by the user is used to indicate which images in the returned results are relevant or irrelevant to the users' search target. Such feedback can be indicated explicitly (by marking labels of relevance or irrelevance) or implicitly (by tracking specific images viewed by the user). Given such feedback information, the initial query can be modified. Alternatively, the underlying features and distance metrics used in representing and matching images can be refined using the relevance feedback information. Ultimately, though, the manual labeling by humans of multimedia data, such as images and video, can be time consuming and inefficient, particularly when applied to large data libraries. Some solutions to the problems described above are disclosed in PCT Patent Application No. PCT/US09/069,237, filed on Dec. 22, 2009, the entirety of which is incorporated herein by reference.
The human brain is an exceptionally powerful visual information processing system. Humans can recognize objects at a glance, under varying poses, illuminations and scales, and are able to rapidly learn and recognize new configurations of objects and exploit relevant context even in highly cluttered scenes. While human visual systems can recognize a wide range of targets under challenging conditions, they generally have limited throughput. Human visual information processing happens with neurons which are extremely slow relative to state-of-the-art digital electronics—i.e. the frequency of a neuron's firing is measured in Hertz whereas modern digital computers have transistors which switch at Gigahertz speeds. Though there is some debate on whether the fundamental processing unit in the nervous system is the neuron or whether ensembles of neurons constitute the fundamental unit of processing, it is nonetheless widely believed that the human visual system is bestowed with its robust and general purpose processing capabilities not from the speed of its individual processing elements but from its massively parallel architecture.
Computer vision systems present their own unique benefits and potential issues. While computer vision systems can process images at a high speed, they often suffer from inadequate recognition accuracy for general target classes. Since the early 1960's there have been substantial efforts directed at creating computer vision systems which possess the same information processing capabilities as the human visual system. These efforts have yielded some successes, though mostly for highly constrained problems. One of the challenges in prior research has been in developing a machine capable of general purpose vision and mimicking human vision. Specifically, an important property of the human visual system is its ability to learn and exploit invariances.