The present invention relates to Selective Max-Pooling For Object Detection.
Generic object detection is confronted by dealing with different degrees of variations in distinct object classes with tractable computations, which demands for descriptive and flexible object representations that are also efficient to evaluate for many locations.
Despite the success of face detection where the target objects are roughly rigid, generic object detection remains an open problem mainly due to the challenge of handling all possible variations with tractable computations. In particular, different object classes demonstrate a variable degree of deformation in images, either due to their nature, e.g., living creatures like cats are generally more deformable than man-made objects like vehicles, or due to viewing distances or angles, e.g., deformable objects may appear somehow rigid at a distance and even rigid objects may show larger variations in different view angles. These pose a fundamental dilemma to object class representations: on one hand, a delicate model describing rigid object appearances may hardly handle deformable objects; on the other hand, a high tolerance of deformation may result in imprecise localization or false positives for rigid objects.
Conventional object detection systems cope with object deformation efficiently with primarily three typical strategies. First, if spatial layouts of object appearances are roughly rigid such as faces or pedestrians at a distance, the classical Adaboost detection mainly tackles local variations with an ensemble classifier of efficient features. Then a sliding window search with cascaded classifiers is an effective way to achieve precise and efficient localization. Second, the deformable part model (DPM) method inherits the HOG window template matching but explicitly models deformations by latent variables, where an exhaustive search of possible locations, scales, and aspect ratios are critical to localize objects. Later on, the DPM has been accelerated by coarse-to-fine search, branch and bound, and cross-talk approaches. Third, object recognition methods using spatial pyramid matching (SPM) of bag-of-words (BoW) models are adopted for detection, and they inherently can tolerate large deformations. These sophisticated detectors are applied to thousands of object-independent candidate regions, instead of millions of sliding windows. In return, little modeling of local spatial appearances leaves these recognition classifiers unable to localize rigid objects precisely, e.g., bottles. These successful detection approaches inspire us to investigate a descriptive and flexible object representation, which delivers the modeling capacity for both rigid and deformable objects in a unified framework.
Features are arguably the most important part of a recognition system. A feature appearing at position “A” in one detection region could be located at position “b” in another. The proposed approach focuses on improving the robustness of features for general object detection. Previous arts solve this problem by using larger feature extraction regions. However, it will also include more non-relevant noises which put a heavy burden on the learning process.