Detecting generic objects in high-resolution images is one of the most valuable pattern recognition tasks, useful for large-scale image labeling, scene understanding, action recognition, self-driving vehicles and robotics. At the same time, accurate detection is a highly challenging task due to cluttered backgrounds, occlusions, and perspective changes. Predominant approaches use deformable template matching with hand-designed features. However, these methods are not flexible when dealing with variable aspect ratios. Regionlets have been used for generic object detection and extends classic cascaded boosting classifiers with a two-layer feature extraction hierarchy which is dedicatedly designed for region based object detection. The innovative framework is capable of dealing with variable aspect ratios, flexible feature sets, and improves upon Deformable Part-based Model in terms of mean average precision. Despite the success of these sophisticated detection methods, the features employed in these frameworks are still traditional features based on low-level cues such as histogram of oriented gradients (HOG), local binary patterns (LBP) or covariance built on image gradients.
As with the success in large scale image classification, object detection using a deep convolutional neural network also shows promising performance. The dramatic improvements from the application of deep neural networks are believed to be attributable to their capability to learn hierarchically more complex features from large data-sets. Despite their excellent performance, the application of deep CNNs has been centered around image classification, which is computationally expensive when transferring to object detection. For example, the approach in [8] needs around 2 minutes to evaluate one image. Furthermore, their formulation of the problem does not take advantage of venerable and successful object detection frameworks such as DPM or Regionlets which are powerful designs for modeling object deformation, sub-categories and multiple aspect ratios.
The regionlets framework in object detection provides accurate generic object detection. Despite its great success, the features fed to this framework are still very low level features populated in previous literatures. On the other hand, the deep convolutional neural network (deep CNN) are well known as a powerful feature learning machine. Conventional methods apply a whole neural network for all possible object locations, leading to unaffordable computational cost. Typically, finding an object in an image costs several minutes or even hours.