Detecting humans in scenes is beneficial in a number of computer vision applications. Human detection is difficult because of internal and external factors, internal factors include illumination variations, insufficient lighting, saturation due to bright lights such as headlights and floodlights, shadows, reflections, weather conditions, scene clutter, other objects, imaging noise and the fidelity of the acquired data. External factors relate to articulated body parts that can move, rotate, and deform and take on different shapes and silhouettes. Humans can stand, lie, walk, run, bend and make other body gestures. Appearance, e.g., height, weight, clothing, etc., differ significantly from one human to another. In addition, the human body has various poses at distinct viewpoints. All of these factors make human detection difficult when compared with rigid objects.
Human detection methods can be categorized in two groups based on the modality of the input data.
Human Detection
Two types of sensors can be used for human detection: visual sensors, such as monocular cameras, and sensors that provide 3D geometric cues, such as one or multi-layer light detection and ranging (LIDAR), and motion detectors. The detectors acquire an input image, determine descriptors for portions (windows) of the image. The descriptors are used by a classifier to determine whether there is a human in any windows, or not.
One method uses Haar wavelets to construct the descriptors and to train multiple linear support vector machines (SVMs). Another method uses a histogram of oriented gradients (HOGs). A rejection cascaded and AdaBoosted classifier can be used with the HOGs to achieve real-time performance. Covariance features (COV) are also known, and a classifier can be based on, an underlying Riemannian manifold. Those holistic methods achieve remarkable results, except for occlusions.
Alternatively, detection can be done by identifying human body parts and their common shapes. In those methods, local features for body parts are determined and combined to form human models. Human silhouette information can also take into account to handle the occlusions. However, performance highly depends on the image resolution for the human body parts.
Detectors use geometric cues to extract features from 3D or range scan data. For example, oriented filters can be applied to spatial depth histograms. Instead of a classifier, a simple threshold operation can be performed to detect humans. Another method converts depth images to 3D point clouds. A dictionary is constructed from geodesic local interest points in another method. That method has a high detection rate as long as humans are not occluded and in contact with other objects.
Another method uses a large feature vector of histograms of local depth information to represent humans. That method handles occlusions, but it is computationally complex and not suitable for real time applications.
Another method uses a LIDAR scan to form a leg descriptor. That method extracts a number of predefined features from segmented line parts and trains classifiers. The method can detect humans when there are no occlusions, and the legs are visible, and the LIDAR is directed at the legs. That method strictly and explicitly requires the LIDAR scan to hit at the leg level to detect humans.