Contemporary image recognition technologies are mostly classifier-based. Typically, positive and negative samples are gathered for each object/scene category to train a classifier. At running time, the classifier is applied to an input image to recognize the image.
However, the training does not scale with respect to the number of object/scene categories. Labeling good training samples for many categories is very tedious work for humans, while training many classifiers on large amounts of data is computationally expensive.
Further, contemporary image recognition technologies do not cover both category and instance recognition. For example, in current face recognizers, an image is first processed with a people/face detector to find people, and then a face recognition system is trained to differentiate the people. Likewise, a car detector is trained to find cars, and then a car recognition system is trained to identify manufacture, model and make.
Still further, contemporary image recognition systems do not fit the dynamic nature of image recognition very well. For example, the number of object/instance categories continues to grow over time; however when a new category is added, the entire system needs re-training, which can be very time-consuming.