The present invention relates to object detection with improved object deformation handling.
Object detection involves recognizing and localizing a specific category of objects inside one image. Deformable objects can have diverse poses which put a lot of burden on the object detector. One of the most popular work aiming at handling deformation is the deformable part-based model. However, it fails to demonstrate its capability of solving deformation problems when testing in the car/dog category. Other approaches employ the bag-of-words (BoWs) model for object detection. However, the BoWs model completely loses the spatial layout which results in the poor detection performance when applying it to rigid objects which does not have too much deformation.
Conventional object detection systems cope with object deformation efficiently with primarily three typical strategies. First, if spatial layouts of object appearances are roughly rigid such as faces or pedestrians at a distance, the classical Adaboost detection mainly tackles local variations with an ensemble classifier of efficient features. Then a sliding window search with cascaded classifiers is an effective way to achieve precise and efficient localization. Second, the deformable part model (DPM) method inherits the HOG window template matching but explicitly models deformations by latent variables, where an exhaustive search of possible locations, scales, and aspect ratios are critical to localize objects. Later on, the DPM has been accelerated by coarse-to-fine search, branch and bound, and cross-talk approaches. Third, object recognition methods using spatial pyramid matching (SPM) of bag-of-words (BoW) models are adopted for detection, and they inherently can tolerate large deformations. These sophisticated detectors are applied to thousands of object-independent candidate regions, instead of millions of sliding windows. In return, little modeling of local spatial appearances leaves these recognition classifiers unable to localize rigid objects precisely, e.g., bottles. These successful detection approaches inspire us to investigate a descriptive and flexible object representation, which delivers the modeling capacity for both rigid and deformable objects in a unified framework.
Generic object detection is confronted by dealing with different degrees of variations in distinct object classes with tractable computations, which demands for descriptive and flexible object representations that are also efficient to evaluate for many locations. Despite the success of face detection where the target objects are roughly rigid, generic object detection remains an open problem mainly due to the challenge of handling all possible variations with tractable computations. In particular, different object classes demonstrate a variable degree of deformation in images, either due to their nature, e.g., living creatures like cats are generally more deformable than man-made objects like vehicles, or due to viewing distances or angles, e.g., deformable objects may appear somehow rigid at a distance and even rigid objects may show larger variations in different view angles. These pose a fundamental dilemma to object class representations: on one hand, a delicate model describing rigid object appearances may hardly handle deformable objects; on the other hand, a high tolerance of deformation may result in imprecise localization or false positives for rigid objects.