1. Technical Field
The present invention relates to convolutional neural networks (CNNs) and more particularly to a system and method for configuring a coprocessor to address accelerating CNNs.
2. Description of the Related Art
Convolutional neural networks (CNNs) have found wide-spread use in applications that extract intelligence from large amounts of raw data. Applications range from recognition and reasoning (such as, handwriting recognition, facial expression recognition and video surveillance) to intelligent text applications (such as semantic text analysis and natural language processing applications, etc.). General-purpose processors do not fully exploit the abundant parallelism inherent in CNNs. As a result, even modern multi-core processors have poor performance due to threading overheads when they exploit the fine-grained parallelism in CNNs. Since many applications have stringent real-time performance and power constraints, conventional processors fail to adequately handle CNNs.
Artificial neural networks are computational models fashioned after the biological neural networks of the brain. They have found applications in a large number of diverse domains such as video and image processing, medical diagnosis systems and financial forecasting. These computation models serve one of two roles: pattern recognition to provide a meaningful categorization of input patterns, or functional approximation where the models find a smooth function that approximates the actual mapping between input and output patterns. These computation models are now embedding intelligence into applications by performing tasks such as recognizing scenes, semantics, and interpreting content from unstructured data. Intelligent classifiers with online learning to perform content-based image retrieval, semantic text parsing and object recognition are expected to be key components of future server and embedded system workloads.
A vast majority of these computational models are still implemented in software on general-purpose, embedded processors. However, these processors do not fully exploit the parallelism inherent in these computational models. Consequently, numerous custom hardware implementations have been proposed, and all the methods attempt to parlay the abundant parallelism inherent in these computational models into significantly higher performance.
A special kind of multi-layer artificial neural network (CNN), has found increasing use in several new applications. CNNs were designed to recognize visual patterns directly from pixel images with very minimal preprocessing of the image. CNNs can recognize patterns with extreme variability (such as handwritten characters), and their recognition ability is not impaired by distortions or simple geometric transformations of the image. Although CNNs were originally intended to accomplish recognition tasks in documents and images, CNNs are now being successfully used in non-vision applications such as semantic analysis. This trend is dramatically increasing their breadth and applicability.
CNNs are a specific kind of neural networks with a unique architecture. Traditionally, neural networks have multiple layers of neurons (an input layer, output layer and one or more so-called hidden layers) where each neuron computes a weighted sum of all its inputs, followed by a non-linear or sigmoid function to restrict its output value within a reasonable range. CNNs are neural networks that use a small 1-D or 2-D array of weights that are convolved with the inputs. In contrast to traditional neural networks, the weights are shared across several inputs.
The fact that CNNs use small kernels of weights forces the extraction of local features by restricting the receptive fields of hidden units to be local. In other words, only a limited number of inputs determine the value of the hidden unit. As an example, consider one layer of a multi-layer neural network and the task of connecting 100 inputs (organized as a 10×10 image) in the input layer of the artificial neural network to 36 hidden units (organized as a 6×6 image) that are part of one feature map. A typical neural network would connect every input to every hidden unit, and the network ends up with 100*36=3600 different weights.
In a CNN architecture, each hidden unit depends only on a small number of inputs, say 25 inputs. Then, only 25 weights are necessary to connect the inputs to a hidden unit. The same weights are then re-used for the remaining hidden units, with the receptive field of each hidden unit being restricted to a set of 25 inputs. Receptive fields of neighboring hidden units overlap, and the degree of overlap can be pre-specified. The total number of weights used in the CNN is 36×25=900. However, there are only 25 distinct weights. If we represent the 25 distinct weights as a 5×5 matrix (also called a kernel matrix), then the hidden units (6×6 image) can be computed as the convolution of the 10×10 input image with the 5×5 kernel matrix. After the convolution step, the value of every hidden unit is subjected to a squashing function (non-linearity). Now, we have a 6×6 image in which a feature of interest has been detected, and the exact location of the feature becomes less important. A simple way to reduce the precision with which the distinctive features are encoded in the image is to reduce the spatial resolution of the image by using sub-sampling. A typical CNN has multiple layers of hidden units.
Evaluating a trained CNN (also known as a feed-forward network) involves performing several convolutions with considerable data movement. Convolutions constitute the performance bottleneck, and orchestration of data movement is extremely difficult. The performance bottleneck can be severe in typical scenarios where a CNN may be employed. Consider a simple face recognition application that is used on relatively high resolution streaming images. With a 320×240 (QVGA) image, a feed forward CNN network that can be used to identify faces within all possible 32×32 windows in the image runs at approximately 6.5 frames per second on a 2.5 GHz Intel Xeon processor when optimized using BLAS (Intel MKL v11). Multi-threading this to 4 and 8 cores on quad-core and dual quad-core machines only improves the speed by a little over 2× due to synchronization overheads, and the fact that different threads share common inputs. Therefore, the most optimized software implementation on state-of-the-art processors achieves about 13 frames per second for a QVGA image, but this speed will decrease by a factor of 4 (to about 3 frames per second) when VGA (640×480) images are analyzed. VGA (or larger) images are more realistic in practical use-case scenarios such as security cameras.