The present embodiments relate to object detection and machine learning of the object detection.
For machine-learnt object detection, input features from the image are used to train and apply the detector. The quality of the feature is crucial to the performance of many image analysis tasks. Various features have been proposed by scientists with a deep understanding of the data and the tasks at hand. For example, Haar features are used in organ detection and segmentation due to their inexpensive computation. Local Binary Pattern (LBP) features are good for representing shape or texture and suitable for human re-identification. None of the features is optimal for all tasks or all types of data. It often requires significant experience to select a good feature for a particular application.
The evolution of sensing technology or variance in existing sensing technology occasionally necessitates the design of a new feature. This process is often challenging since the underlying physics of data generation might not be easy to understand. Dictionary learning and sparse coding algorithms have been used to learn features. For example, the dictionary learning and sparse coding algorithms learn Gabor-like patterns when applied to natural images. However, this approach may produce unstable features due to the high coherence of dictionary atoms. In other words, if there are multiple dictionary atoms that look alike, sparse codes may jump between any of these atoms. Although this is not an issue for image reconstruction or image denoising applications, the instability of features may pose a serious challenge for learning a good classifier or detector.
Another approach to learn features with a machine uses an auto-encoder (AE) or a restricted Boltzmann machine (RBM). Stable features are produced since the decoding step only involves matrix multiplications. In addition, AE or RBM are often stacked to create a deep network of multiple layers. This hierarchical structure may capture more abstract features at the later layers. The output from the last layer of the deep network is used as a learned feature. Although the network may be fine-tuned using discriminative training, the output may not capture all relevant information. This can be caused by the lack of labelled samples or the back-propagation optimization stuck in a bad local minima. In addition, the fully connected structure of the deep network makes it difficult to learn with large images or volumes. For example, a network would have about 30 million free parameters when learning 1000 features from color images of 100×100 pixels.