Detecting objects is a fundamental step in many image-processing and computer vision applications. FIG. 1 illustrates an example of an image 100 containing objects of interest 105 (e.g., a person) and 110 (e.g., a vehicle) detected utilizing conventional object detection techniques. The prior art object detection approach illustrated in FIG. 1 involves receiving the image 100 containing the objects of interest 105 and 110 and generates an annotating object such as a bounding box 115 about the object location as a desired output. The task (i.e., object type/instance to locate) can be specified by a data training set, which can include data indicative of images annotated with locations of a relevant object in the form of the bounding box 115, which provides examples of desired input and output pairs.
Detecting particular portions or parts of the object permits further interpretation of the image because simply locating the object may not be sufficient for certain applications. For example, the location of specific body parts (e.g., head, torso, arms, etc.) can be detected to estimate a particular pose or to infer an action performed by a person. Similarly, different sides of the vehicle (e.g., front, rear, left right, license plate, wheels, etc.) can be detected to interpret position of the vehicle.
FIG. 2 illustrates an example of an image 150 of the object/person 105 and the object/vehicle 110 detected using another object part detection technique. The object part with respect to the image 150 can be determined utilizing a fixed set of bounding boxes 160 corresponding to different pre-defined parts. The detection of object parts in this fashion provides more detailed information and enables a deeper reasoning regarding a scene depicted in the image 150 than knowing the object location.
Data-driven detection (DDD) is another approach for detecting objects in an image by computing the similarity between the input image and all the images in an annotated training set. The location of the object can be predicted as a weighted average of the annotated object locations in first few neighbors. The similarity can be a standard similarity between images, or obtained with a similarity learning algorithm. The detection can be formulated as a query to a database and the similarities can be computed from a global image descriptor so that the prediction employs information from the whole image (i.e., parts of the image different from the object might be also employed as cues to predict the object). The obtained similarity between the images is not a generic similarity, but it is tailored to an end-task of detection when using a metric learning.
The problem associated with such DDD techniques is that the images are annotated with a single bounding box and does not take into account the global consistency between parts. Additionally, DDD is less efficient because a sliding window approach has to be run once for each part and a high-level consistency model is usually employed in order to ensure the feasibility of the part combinations. Furthermore, DDD is not accurate and contains inefficiencies because the DDD approach involves evaluation of a large set of sub-regions and does not leverage context because the classification is only based on information within the sub-region. Such methods for detecting parts of the objects are typically very costly.
Based on the foregoing, it is believed that a need exists for improved methods and systems for predicting an object part location based on extended data-driven detection, as will be described in greater detailed herein.