People have the ability to quickly identify and distinguish between a seemingly limitless number of objects with little effort. Even when viewpoint, size, or scale of an image of an object is varied, people are typically able to recognize the object rather quickly. Individuals can even recognize objects in images when they are partially obstructed from view. However, obstacles such as differing viewpoints, sizes, scales, and partial obstruction of objects in images haven proven to be difficult and computationally expensive for computer recognition systems.
A technique that has been employed for object detection in images consist of analyzing pieces of an image by running object filters across an image in a sliding window fashion, computing the byproduct of the object filters with the underlying image at every location in the image, and using the largest value, or value set, across a particular threshold for object detection. In addition, images to be analyzed for object detection can be of poor quality, and are rarely captured at a uniform size, scale, or viewpoint. As a consequence, computer recognition systems often have to learn object filters for different viewpoints, and convolve the object filters on an image pyramid during object detection.
Typically, anywhere from several hundred to several thousand object filters can be used for object detection or localization. It can be readily appreciated that computing the byproduct of up to several thousand object filters for multiple viewpoints at every location on an underlying image can require significant time and computational resources. The amount of resources required in such computer recognition systems can be limiting. For example, such systems may have difficulty in scaling for multiple object categories.