As is known in the art, the capacity to automatically and robustly detect arbitrary objects in images and video is an important component of automated computer vision systems. Typical objects of interest could be humans, animals, vehicles, packages, airplanes, boats, buildings etc. Recognition of such objects in images and video is an important first step in a number of applications, including but not limited to, automated video/image search, automated visual surveillance, robotics, automated aerial reconnaissance etc. Automatically detecting arbitrary objects of interest in a given image or video sequence however is a difficult problem. This difficulty arises due to wide variability in appearance of the object of interest due to viewpoint changes, illumination conditions, shadows, reflections, camera noise characteristics, object articulation (if any) and object surface properties. The detection problem is further exacerbated if the object of interest is only partially visible either due to occlusion by static scene structures or by occlusions by other objects.
One way to get around the problem of detecting partially visible objects is to design and train several different detectors, that each detect only a certain part of the object. For instance, to detect airplanes, one could design detectors to separately detect airplane wings, fuselage, landing gear and tail. To complete the system, a final step would then be needed to infer the existence of an airplane given the detection of the above-mentioned parts. It is important to note that such a part based object detection strategy would be prone to generating a large amount of false positives. This is primarily due to the fact that it is more likely that there exist structures in the background that resemble a part of the object than the complete object. The final step, a process of going from a set of, possibly erroneous, object part detections to a set of scene consistent, context sensitive, set of object hypotheses is far from trivial.
As is also known in the art, the capacity to robustly detect humans in video is a critical component of automated visual surveillance systems. The primary objective of an automated visual surveillance system is to observe and understand human behavior and report unusual or potentially dangerous activities/events in a timely manner. Realization of this objective requires at its most basic level the capacity to robustly detect humans from input video. Human detection, however, is a difficult problem. This difficulty arises due to wide variability in appearance of clothing, articulation, viewpoint changes, illumination conditions, shadows and reflections, among other factors. While detectors can be trained to handle some of these variations and detect humans individually as a whole, their performance degrades when humans are only partially visible due to occlusion, either by static structures in the scene or by other humans. Human body part based detectors are better suited to handle such situations because they can be used to detect the un-occluded human body parts. However, the process of going from a set of partial human body part detections to a set of scene consistent, context sensitive, human hypotheses is far from trivial.
Since human body part based detectors only learn human body part of the information from the whole human body, they are typically less reliable and tend to generate large numbers of false positives. Occlusions and local image noise characteristics also lead to missed detections. It is therefore important to exploit contextual, scene geometry and human body constraints to weed out false positives, but also be able to explain as many valid missing human body parts as possible to correctly detect occluded humans.
Approaches to detect humans from images/video tend to fall primarily in two categories: those that detect the human as a whole and those that detect humans based on human body part detectors. Among approaches that detect humans as a whole, Leibe et. al [see B. Leibe, E. Seeman, and B. Schiele. Pedestrian detection in crowded scenes. In IEEE CVPR'05 in San Diego, Calif., pages 878-885. sp, May 2005] employs an iterative method combining local and global cues via a probabilistic segmentation. Gavrilla [see D. Gavrila and V. Philomin. Real-time object detection for smart vehicles. In ICCV99, pages 87-93, 1999 and D. Gavrila. Pedestrian detection from a moving vehicle. In ECCV00, pages II: 37-9, 2000 and D. Gavrila and V. Philomin. Real-time object detection for smart vehicles. In ICCV99, pages 87-93, 1999] uses edge templates to recognize full body patterns, Papageorgiou et. al. [see C. Papageorgiou, T. Evgeniou, and T. Poggio. A trainable pedestrian detection system. Intelligent Vehicles, pages 241-246, October 1998] uses SVM detectors, and Felzenszwalb [see P. Felzenszwalb. Learning models for object recognition [see In CVPR01, pages 1: 1056-1062, 2001] uses shape models. A popular detector used in such systems is a cascade of detectors trained using AdaBoost as proposed by Viola and Jones [see P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. IEEE Conference on Computer Vision and Pattern Recognition (CVPR'01), 2001]. Such an approach uses as features several haar wavelets and has been very successfully applied for face detection in [see P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. IEEE Conference on Computer Vision and Pattern Recognition (CVPR'01), 2001 and P. Viola, M. Jones, and D. Snow. Detecting pedestrians using patterns of motion and appearance. In ICCV03, pages 734-741, 2003.] Viola and Jones [see P. Viola, M. Jones, and D. Snow. Detecting pedestrians using patterns of motion and appearance. In ICCV03, pages 734-741, 2003] applied this detector to detect pedestrians and made an observation that Haar wavelets [see Haar wavelet features: see B. Wu and R. Nevatia. Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. ICCV, October 2005. Beijing and T. Zhao and R. Nevatia. Bayesian human segmentation in crowded situations. CVPR, 2:459-466, 2003, and B. Leibe, E. Seemann, and B. Schiele. Pedestrian detection in crowded scenes. In IEEE CVPR'05 in, San Diego, Calif., pages 878-885. sp, May 2005] are insufficient by themselves as features for human detection and augmented their system with simple motion cues to get better performance. Another feature that is increasing in popularity is the histogram of oriented gradients. It was introduced by Dalal and Triggs [see Dalal and B. Triggs. Histograms of oriented gradients for human detection. In CVPR05, pages I: 886-893, 2005] who used a SVM based classifier. This was further extended by Zhu et. al [see Zhu, M. Yeh, K. Cheng, and S. Avidan. Fast human detection using a cascade of histograms of oriented gradients. In CVPR06, pages 11: 1491-1498, 2006] to detect whole humans using a cascade of histograms of oriented gradients.
Human body part based representations have also been used to detect humans. Wu and Nevatia [see Wu and R. Nevatia, Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet human body part detectors. ICCV, October 2005. Beijing] use edgelet features and learn nested cascade detectors [see Huang, H. Al, B. Wu, and S. Lao. Boosting nested cascade detector for multi-view face detection. In ICPR04, pages II: 415-418, 2004] for each of several body parts and detect the whole human using an iterative probabilistic formulation. Mikolajczyk et al. [see K. Mikolajczyk, C. Schmid, and A. Zisserman. Human detection based on a probabilistic assembly of robust human body part detectors. In ECCV, May 2004] divides the human body into seven human body parts and for each human body part a Viola-Jones approach is applied to orientation features. Mohan et. al [see A. Mohan, C. Papageorgiou, and T. Poggio]. Example-based object detection in images by components. PAMI, 23(4): 349-361, April 2001] divides the human into four different human body parts and learns SVM detectors using Haar wavelet features [see B. Wu and R. Nevatia. Detection of multiple, partially occluded humans in a single image by bayesian combination of edgelet part detectors. ICCV, October 2005. Beijing and T. Zhao and R. Nevatia. Bayesian human segmentation in crowded situations. CVPR, 2:459-466, 2003, and B. Leibe, E. Seemann, and B. Schiele. Pedestrian detection in crowded scenes. In IEEE CVPR'05 in San Diego, Calif., pages 878-885. sp, may 2005] follow up low level detections with some form of high level reasoning that allows them to enforce global constraints, weed out false positives, and increase accuracy.
Logical reasoning has been used in visual surveillance applications to recognize the occurrence of different human activities [see V. Shet, D. Harwood, and L. Davis. Vidmap: video monitoring of activity with prolog. In IEEE AVSS, pages 224-229, 2005] and, in conjunction with the bilattice framework, to maintain and reason about human identities as well [see V. Shet, D. Harwood, and L. Davis, Multivalued default logic for identity maintenance in visual surveillance, In ECCV, pages IV: 119-132, May 2006].