Given a static image or a video, the basic goal of pedestrian detection is to automatically and accurately locate the positions of human body instances appeared in the image plane. It has many important applications including surveillance, autonomous driving, social robotics, animations for gaming and the caring for the elderly. Consequently, developing automated vision systems for effective pedestrian detection is attracting ever growing attention in both academic and industrial fields.
Pedestrian detection has been an active research area in computer vision for a few decades. It has served as the playground to explore ideas for generic object detection tasks. Existing methods for pedestrian detection can be grouped into three solution families: (1) Deformable Part Models (DPM) and its invariants; (2) boosting with cascaded decision trees; and (3) Convolutional Neuron Networks (CNN). To get good accuracy, the first two solution families usually use complicated hand-crafted features such as HOG (Histograms of Oriented Gradients) and an ensemble of local classifiers. Furthermore, most of them rely on sliding window detection paradigm in which approximately 1000K candidate windows should be scanned even for a 640×480-pixel image. This places a heavy computational burden on offline trained detectors.
In the most recent three years, deep CNN based methods have demonstrated completely leading performance in many popular computer vision tasks such as object classification, generic object detection, face recognition, image segmentation, and so forth.
Available CNN related methods primarily address pedestrian detection in three different ways; (1) CNN is just used as the feature pool for augmenting feature selection; (2) CNN is employed to refine the output results of traditional pedestrian detectors; and (3) CNN serves as the primary solution to handle pedestrian detection. However, the performance of known such kind of methods does not reach to the state-of-the-art. Furthermore, they still use sliding window detection paradigm. That is, in pedestrian detection, these methods historically underperform the best known traditional methods. The improved performance is only manifested in the works of using CNNs to augment input features for boosting or refining the outputs of traditional pedestrian detectors.