Extant image recognition algorithms, such as classifiers, cascading classifiers, neural networks, convolutional neural networks, or the like achieve high accuracy (e.g., comparable to humans) when images contain a single object that is large and centered. However, the accuracy of extant algorithms is decreased when multiple objects are present within the image. This is due, in part, to conflicting classifications when different objects (such as a vehicle and a person, a kitten and a ball, or the like) are present within the image. This is also due, in part, to interference between features extracted from the multiple objects within the same image.
One extant solution for object recognition is the use of bounding box algorithms. Such algorithms, which often comprise neural networks or convolutional neural networks, may detect a plurality of objects within an image and assign one or more possible boxes to the detected objects, the boxes defining areas of the image corresponding to the detected objects. However, such algorithms usually generate many (e.g., on the order of 1000 or more) bounding boxes. Extant techniques, such as non-maximum pooling, allow for reducing the number of bounding boxes. However, such techniques often do not account for false positives. Moreover, such techniques often do not allow for the selection of one or more classes of interest and/or regions of interest within the image.
Moreover, extant bounding box algorithms usually only classify objects enclosed by bounding boxes into one of a limited number of classes, such as “person,” “vehicle,” “sign,” or the like. More detailed object recognition (such as identifying a make and model of a detected vehicle or identifying an architectural style of a detected building or other fine-grained classification problems) often requires additional processing. For example, additional convolutional neural networks that recognize objects based on feature analysis may be used. However, such recognition is particularly error prone when a plurality of possible objects is included in the same image. Moreover, such recognition is costly and, therefore, cannot effectively be performed on the large number of possible bounding boxes detected within the image.