Generally, cameras acquire more images and videos than can be viewed and analyzed. Therefore, there is an increasing need for computer based methods and systems that automate the analysis of images and videos. A fundamental task in automated image and video analysis is identification and localization of different classes of objects in scenes acquired by cameras.
The most common approach for object classification uses a scanning window, where a classifier applied to pixels in the window as it is scanned over the image. Typically, the window is rectangular and of a fixed size. The classifier indicates whether or not the object is in the window. If necessary, the image can be resized to fit the object. The resizing is done repeatedly until the resized image matches or is smaller than the window. This brute force search is repeated over all locations and sizes. The method can be repeated for each class of objects.
Those methods effectively only use appearance information available from pixels in the window. However, such approaches fail to utilize the structural information based on both relative appearances and layouts of different objects in the scene with respect to each other, or priors based on an overall object or scene structure.
Several methods are based on context information. Those methods use 3D scene structure to infer a likelihood of that an object is located in the image. For example, to detect people using a camera arranged on a vehicle, the knowledge of the locations of the road, sidewalk and buildings in the image plane can be used to generate per location object likelihoods. Similarly, rough 3D structure of the scene can be inferred from images using camera geometry and the image cues, which can be used to generate object likelihoods. The contextual object likelihoods can then be combined with classifier scores to improve detection performance. However, obtaining scene structure is a difficult task, which limits the usefulness of those methods.
An alternative method searches for parts of the object in the image and combines the parts to detect and localize the object. Those methods combine appearance based classification of object parts and geometric relations into a single classifier. Due to high computational complexity of the joint classification, in parallel, only simple geometric relationships, such as the distances between pairs of object parts, can be used.
Another method uses simple spatial relation based features of different objects, such as “above,” “below,” “next-to,” an “on-top.” The relations are combined with appearance features in a multi-class classifier.