A. Technical Field
The present invention relates to computer processing and, more particularly, to systems, devices, and methods for end-to-end object recognition in computer vision applications.
B. Description of the Related Art
Our day-to-day lives abound with instances of object detection. Detecting nearby vehicles while driving a car, or localizing a familiar face all examples of object detection. Object detection is one of the core tasks in computer vision applications. Before the success of convolutional neural networks (CNNs), object detection generally involved sliding window based methods that apply classifiers on handcrafted features that are extracted at all possible locations and various scales of an image. Recently, fully convolutional neural network (FCN) based methods revolutionized the field of object detection. While FCN frameworks also use a sliding window method, their end-to-end approach of learning model parameters and image features from scratch significantly improves detection performance.
Region-based CNN (R-CNN) methods further improve the accuracy of object detection beyond FCN-based methods. Conceptually, R-CNN operates in two phases. In a first phase, region proposal methods generate all potential bounding box candidates in the image. In a second phase, for every proposal, a CNN classifier is applied to distinguish between objects. Although R-CNN is gradually evolving as the new state-of-the-art system for general object detection, it continues to suffer from the inability to detect small objects, such as human faces, and distant objects, such as cars, as the low resolution and lack of context in each candidate box significantly decreases classification accuracy. Moreover, the two different stages in the R-CNN pipeline cannot be optimized jointly, which makes the application of end-to-end training on R-CNN rather problematic. Accordingly, what is needed are systems and methods that overcome the limitations of existing designs.