Object detection plays a fundamental role in intelligent video surveillance systems. The ability to automatically search for objects of interest in large video databases or in real-time video streams often involves, as a pre-requisite, the detection and localization of objects in the video frames.
Traditional surveillance systems usually apply background modeling techniques [(C. Stauffer and W. Grimson, Adaptive background mixture models for real-time tracking, CVPR, 1998, 1); (Y. Tian, M. Lu, and A. Hampapur, Robust and efficient foreground analysis for real-time video surveillance, CVPR, 2005, 1)] for detecting moving objects in the scene, which are efficient and work reasonably well in low-activity scenarios. However, the traditional surveillance systems are limited in their ability to handle typical urban conditions such as crowded scenes and environmental changes like rain, snow, reflections, and shadows. In crowded scenarios, multiple objects are frequently merged into a single motion blob, thereby compromising higher-level tasks such as object classification and extraction of attributes.
Appearance-based object detectors [(N. Dalal and B. Triggs. Histograms of oriented gradients for human detection, CVPR, 2005, 1); (P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, Object detection with discriminatively trained part based models, IEEE Transactions on PAMI, 2010, 1)] arise as a promising direction to deal with these challenging conditions. Specifically for applications that require real-time processing, cascade detectors based on Haar-like features have been widely used for detection of faces [P. Viola and M. Jones. Robust Real-time Object Detection, International Journal of Computer Vision, 2004, 1, 2, 3, 4], pedestrians [P. Viola, M. Jones, and D. Snowi, Detecting pedestrians using patterns of motion and appearance, ICCV, 2003, 1] and vehicles [R. S. Feris, B. Siddiquie, Y. Zhai, J. Petterson, L. Brown, and S. Pankanti, Attribute-based vehicle search in crowded surveillance videos, ICMR, 2011, 1]. Although significant progress has been made in this area, state-of-the-art object detectors are still not able to generalize well to different camera angles and lighting conditions. As real deployments commonly involve a large number of surveillance cameras, training per-camera detectors is not feasible due to the annotation cost. Online adaptation methods [(V. Jain and E. Learned-Miller, Online domain adaptation of a pre-trained cascade of classifiers, CVPR, 2011, 1, 2); (S. Pan, I. Tsang, J. Kwok, and Q. Yang, Domain adaptation via transfer component analysis, IEEE Transactions on Neural Networks, 2011, 1, 2)] have been proposed to adapt a general detector to specific domains, but the online adaptation methods usually require a small number of manual labels from the target domain. Most methods rely on adaptation of weights only, while keeping the same features and the same computational complexity of the original detector.
Various methods have been proposed for object detection in images and videos. Deformable part-based models [P. Felzenszwalb, R. Girshick, D. McAllester, and D. Ramanan, Object detection with discriminatively trained part based models, IEEE Transactions on PAMI, 2010, 1], classifiers based on histograms of oriented gradient features [N. Dalal and B. Triggs. Histograms of oriented gradients for human detection, CVPR, 2005, 1], and convolutional neural networks [Y. LeCun, K. Kavukvuoglu, and C. Farabet, Convolutional networks and applications in vision, ISCAS, 2010, 1] are examples of successful approaches that have achieved state of-the-art results in several standard datasets. In general, however, these methods run at less than 15 frames per second on conventional machines and therefore may not be applicable to surveillance applications that require processing many video channels per server.
Cascade detectors [(P. Felzenszwalb, R. Girshick, and D. McAllester, Cascade object detection with deformable part models, CVPR, 2010, 2); (P. Viola and M. Jones. Robust Real-time Object Detection, International Journal of Computer Vision, 2004, 1, 2, 3, 4)] have been commonly adopted for efficient processing. Viola and Jones [P. Viola and M. Jones, Robust Real-time Object Detection, International Journal of Computer Vision, 2004, 1, 2, 3, 4] introduced a robust and efficient detector based on a cascade of Adaboost classifiers, using fast-to-compute Haar-like features. Many variants of this algorithm, including different boosting models and different features have been proposed in the past few years. Confidence measures for cascade detectors have not been well studied.
Co-training techniques [(O. Javed, S. Ali, and M. Shah, Online detection and classification of moving objects using progressively improving detectors, CVPR, 2005, 2); (P. Roth, H. Grabner, D. Skocaj, H. Bischof, and Leonardis, On-line conservative learning for person detection, PETS Workshop, 2005, 2)] have been applied to boost the performance of object detection in specific domains, by training separate classifiers on different views of the data. The confidently labeled samples from the first classifier are used to augment the training set of the second classifier and vice versa. The underlying assumption of co-training is that the two views of the data are statistically independent, which may be violated especially when the features are extracted from a single modality.
Several on-line adaptation methods [(V. Jain and E. Learned-Miller. Online domain adaptation of a pre-trained cascade of classifiers, CVPR, 2011, 1, 2); (S. Pan, I. Tsang, J. Kwok, and Q. Yang, Domain adaptation via transfer component analysis, IEEE Transactions on Neural Networks, 2011, 1, 2)] have been proposed to adapt general detectors to specific domains. Usually these techniques either require few manual labels from the target domain or suffer from inaccuracies in capturing online data to correctly update the classifier. With few exceptions [H. Grabner and H. Bischof, Online boosting and vision, CVPR, 2006, 2], only feature weights are adapted and not the features themselves. As a result, the adapted classifier is generally at least as expensive as the original detector. Online learning has also been applied to improve tracking [(H. Grabner, C. Leistner, and H. Bischof, Semi-supervised on-line boosting for robust tracking, ECCV, 2008, 2); (S. Avidan, Ensemble tracking, IEEE Transactions on PAMI, 2007, 2)], with the assumption that an object appears in one location only.
Feris et al [R. S. Feris, J. Petterson, B. Siddiquie, L. Brown, and S. Pankanti, Large-scale vehicle detection in challenging urban surveillance environments, WACV, 2011, 2] proposed a technique to automatically collect training data from the target domain and learn a classifier. However, the technique requires user input to specify regions of-interest and attributes such as motion direction and acceptable Δs of the object of interest. More recently, Siddiquie et al [B. Siddiquie, R. Feris, A. Datta, and L. Davis, Unsupervised model selection for view-invariant object detection in surveillance environments, ICPR, 2012, 2] proposed a method that takes into account scene geometry constrains to transfer knowledge from source domains to target domains. This approach can even achieve better performance than a detector trained with samples from the target domain, but requires a large battery of source domain detectors covering different poses and lighting conditions.
There are existing algorithms to distinguish foreground objects from background, based on brightness, color, and features beyond the visible spectrum such as infrared. These algorithms typically rely on thresholds, for example, a brightness threshold, to indicate the presence of a foreground object. For example, the thresholds may be manually adjusted by a human to account for variations in lighting, camera response, etc. to ensure that a vehicle's image surpassed the applicable thresholds to distinguish the vehicle from the background. However, the manual adjustment procedure is inefficient and subject to human error.