1. Field of Invention
The present patent document is directed towards systems and methods for object detection. More particularly, the present patent document is directed towards systems and methods for generating and using object detection models for recognizing objects in an image (video or still image).
2. Description of the Related Art
Object detection from images can be important to many applications, such as surveillance, robotics, manufacturing, security, medicine, and automotive safety—just to name a few areas of application. However, object detection is among the most challenging vision tasks.
Among various approaches to object detection, the sliding window approach dominates due to its good performance, efficiency, parallelizability, and easy implementation. Sliding-window-based detectors treat object detection as a classification problem. Typically, the whole image is densely scanned from the top left to the bottom right with rectangular scanning windows of different sizes. For each possible scanned rectangle, certain features such as edge histogram, texture histogram, shape-based features, pose-invariant features, wavelet coefficients, or combinations thereof are extracted and supplied to an offline trained classifier that has been trained using labeled training data. The classifier is trained to classify any rectangle bounding an object of interest as a positive sample and to classify all other rectangles as negative samples.
The performances of sliding-window-based detectors are mainly determined by two factors: the feature and the underlying classification algorithm. Many supervised learning algorithms such as various boosting algorithms, Support Vector Machine (SVM) of different flavors including linear, kernel, multi-kernel, latent, structured, etc., and Convolutional Neural Networks (CNN), have been applied to object detection during the past decade. The selection of underlying classifier/regressor is determined by various factors including the feature, the distribution of the training data, and the computational complexity.
To ensure the detector has enough learning capacity to learn from training data and can be generalized well, people frequently resort to the Occam's razor principle to select underlying classifiers—namely, they want to pick up a classifier, as simple as possible, with good performance on training data. A key issue, with a spectrum of classifiers with different model complexity, is whether it is possible to automatically pick up a classifier with appropriate complexity and to learn the corresponding model parameters. When the distribution of data in the input space is uneven, local learning algorithms can adjust the learning capacity locally to improve the overall performance. Various approaches have been proposed to tackle the problem of high variance of data complexity in input space. For example, at least one method has been proposed that uses Support Vector Machine (SVM)-k-Nearest Neighbor (KNN) (SVM-KNN) that attempts to handle this problem but at the expense of high computational complexity. Local learning algorithms are superior in adjusting the learning capacity according to the local data distribution. Alternatively, when the data distribution can be effectively approximated using a number of clusters, algorithms based on tree or forest models have been successfully used that yields high performance. During training, a hierarchical discriminative tree model is recursively constructed in which each node contains a cluster of data that are then separated by using classifiers learned from the cluster exclusively.
However, three main difficulties still exist in real world applications. First, probing the local data distribution is computationally prohibitive. For example, some of the prior methods rely on the k-Nearest Neighbor (KNN) algorithm to guide the local classifiers for each testing sample. The probing procedure limits the application of the algorithms in large scale learning practice such as object detection. Second, the localities depend on data distribution. In KNN-based algorithms, a region with a simple distribution should be covered with a relatively small K whereas a region with a complicated distribution should be covered with a large K. The “K” in the KNN algorithm is a constant and cannot fulfill such an adaptive task. Third, the performance of an exclusively learned local classifier relies on the population of the cluster while ignoring the potential strength it could borrow from data in other clusters. In tree-based methods, for example, complex data distributions may lead to low-population clusters, making the exclusively learned local classifiers potentially under trained.
Accordingly, systems and methods are needed that can address these difficulties and produce better detection results when trying to detect an object or item in an image (still image or video image).