The exemplary embodiment relates to object detection and finds particular application in connection with an automated system and method for generating an object detector based on a sequence of images.
Algorithms for the automatic analysis of video data have been developed for detecting objects of interest, such as pedestrians and vehicles, in videos. Applications for such methods include long-term tracking of objects (K. Fragkiadaki, et al, “Two-granularity tracking: mediating trajectory and detection graphs for tracking under occlusions,” ECCV (2012)), event retrieval (R. Feris, et al., “Large-scale vehicle detection, indexing, and search in urban surveillance videos,” IEEE Trans. on MM, (2012)), and human behavior understanding (S. Pellegrini, et al., “You'll never walk alone: Modeling social behavior for multi-target tracking,” CVPR (2009)). In one approach to object detection, an “object vs. background” classifier is applied to a sliding window which is traversed over all possible locations in an image (see, N. Dalai, et al., “Histograms of oriented gradients for human detection,” CVPR (2005); P. F. Felzenszwalb, et al., “Object detection with discriminatively trained part-based models,” IEEE TPAMI (2010), hereinafter, “Felzenszwalb 2010”). To achieve a good accuracy and low false alarm rate, such a classifier is trained using manually annotated images defining the category of interest. To account for variability within the category, many examples may be needed. Accordingly, object detectors typically exploit large, high-quality, curated training data from a specific source of images. For example, labeled images in selected visual object classes, such as from the PASCAL VOC challenges or ImageNet may be employed. This form of supervised learning, however, is expensive and may still not provide object detectors that generalize well to a new source of data as the training examples may not be representative of the target domain of application (see, A. Torralba and A. A. Efros, “Unbiased look at dataset bias,” CVPR (2011)).
As an example, video cameras could be positioned at different locations to capture video images for identifying objects in the same category, such as cars or pedestrians. Conditions at each of the locations may be different, for example in terms of lighting, type of buildings, and so forth. To address these differences, a specific detection model for the object of interest could be generated for each video camera. This would entail regular collection and labeling of data and may be cost prohibitive for a large number of cameras. As an alternative, a generic detector could be learned and employed for all the cameras. However, this approach may lead to suboptimal performance, for example, exhibiting high precision at very low recall, with only the few best ranked detections being correct.
There remains a need for a reliable method for generating detection models for objects of interest that are well adapted to different conditions without requiring large quantities of specific training data.