Object detection and identification/classification are important aspects of many systems. These functions are based on the processing and interpretation of images, and are used in many applications and settings involving image, object, and pattern recognition, typically as part of a decision process. Example applications include security, access control, identification/authentication, machine vision, artificial intelligence, engineering, manufacturing, robotics, systems control, autonomous vehicles, and other situations involving some form of object or pattern recognition, object detection, or automated decision-making based on an image.
A neural network is a system of interconnected artificial “neurons” that exchange messages between each other. The connections have numeric weights that are tuned during the training process, so that a properly trained network will respond correctly when presented with an image or pattern to recognize. The network consists of multiple layers of feature-detecting “neurons”. Each layer has many neurons that respond to different combinations of inputs from the previous layers. Training of a network is performed using a “labeled” dataset of inputs in a wide assortment of representative input patterns that are associated with their intended output response. Training uses general-purpose methods to iteratively determine the weights for intermediate and final feature neurons. In terms of a computational model, each neuron calculates the dot product of inputs and weights, adds the bias, and applies a non-linear trigger function (for example, using a sigmoid response function). Deep neural networks (DNN) have shown significant improvements in several application domains including computer vision and speech recognition. In computer vision, a particular type of DNN, known as a Convolutional Neural Network (CNN), has demonstrated state-of-the-art results in object recognition and detection.
A convolutional neural network (CNN) is a special case of the neural network described above. A CNN consists of one or more convolutional layers, often with a subsampling layer, which are followed by one or more fully connected layers, as in a standard neural network. Convolutional neural networks (CNN) have been used for purposes of image processing, and have shown reliable results in object recognition and detection tasks that are useful in real world applications. Convolutional layers are useful for image processing, as they extract features from images relatively quickly and learn to extract the right features for the problem they are trained on (e.g., convolutional layers trained for classification will learn different filters (i.e., weights) than layers trained for regression, because different aspects or characteristics matter in each of those scenarios).
FIG. 1(a) is a diagram illustrating elements or stages of a conventional convolutional neural network (CNN) 100, showing one or more convolutional stages 102, sub-sampling 104, and fully connected stages 106 leading to the production of an output 108; FIG. 1(b) is a diagram illustrating a conventional image-processing pipeline, which may incorporate a convolutional neural network (CNN). As shown in FIG. 1(a), input data (such as a digitized representation of an image) is provided to one or more convolutional stages 102 (represented as “1st Stage” and “2nd Stage” in the figure). The output of each convolutional stage is provided as an input to the following stage; in some cases, further subsampling operations 104 may be carried out. A final subsampling stage acts as a Classifier, with an output being passed to one or more fully connected stages 106 to produce an output 108. In terms of the image processing pipeline shown in FIG. 1(b), note that the CNN is used primarily for purposes of object detection and object recognition, with other functions or operations being used to (pre)process an image (such as noise reduction, scaling, etc.) and to perform a decision making function.
In the traditional model of pattern/image recognition, a hand-designed feature extractor gathers relevant information from the input and eliminates irrelevant variabilities. A trainable classifier, a standard neural network that classifies feature vectors into classes, follows the extractor. In a CNN, convolution layers play the role of feature extractor, with the convolution filter kernel-weights being determined as part of the training process. Convolutional layers are able to extract the local features because they restrict the receptive fields of the hidden layers to be local.
In CNNs, the weights of the convolutional layer used for feature extraction, as well as the fully connected layer used for classification, are determined during a training process. The improved network structures of CNNs lead to savings in memory requirements and computation complexity requirements and, at the same time, give better performance for applications where the input has local correlation (e.g., images and speech).
By stacking multiple and different layers in a CNN, complex architectures are built for classification problems. Four types of layers are most common: convolution layers, pooling/subsampling layers, non-linear layers, and fully connected layers. The convolution operation extracts different features of the input. The first convolution layer extracts low-level features such as edges, lines, and corners; higher-level layers extract higher-level features. The pooling/subsampling layer operates to reduce the resolution of the features and makes the features more robust against noise and distortion. There are two ways to do pooling: max pooling and average pooling. Neural networks in general (and CNNs in particular) rely on a non-linear “trigger” function to signal distinct identification of likely features on each hidden layer. CNNs may use a variety of specific functions, such as rectified linear units (ReLUs) and continuous trigger (non-linear) functions, to efficiently implement this non-linear triggering function. Fully connected layers are often used as the final layers of a CNN. These layers mathematically sum a weighting of the previous layer of features, indicating the precise mix of factors to determine a specific target output result. In case of a fully connected layer, all of the elements of all the features of the previous layer are used in the calculation of each element of each output feature.
In addition to recent progress in the area of object recognition, advancements have been made in virtual reality, augmented reality, and “smart” wearable devices. These trends suggest that there is a market demand and need for implementing state-of-the-art image processing and object recognition in smart portable devices. However, conventional CNN-based recognition systems typically require relatively large amounts of memory and computational power to implement (as they typically require a large number of floating point calculations). Thus, while such CNN based systems perform well on expensive, GPU-based machines, they are often unsuitable for smaller devices such as cell/smart phones, tablet computing devices, and embedded electronic devices.
Conventional approaches to detecting objects in an image, while capable of yielding useful results, are not well suited for some applications and contexts. These include uses where certain of the image processing and object recognition operations are performed using less computational and data storage resources than conventional processors used for such tasks (such as GPUs). This is particularly true when computationally intensive approaches, such as conventional convolutional neural networks, are used for image processing. Thus, systems and methods are needed to provide for performing image processing and object recognition using CNNs that may be implemented on a resource-constrained device, such as a smart phone, tablet, embedded device, or similar environment.
Embodiments of the invention are directed toward solving these and other problems individually and collectively.