Referring to FIG. 1, convolutional neural networks (CNNs) 10 usually comprise multiple layers including convolutional layers 21-1, 12-2 or fully connected layers 14-1, 14-2, typically accompanied by pooling 16-1, 16-2, 16-3 or regularization tasks:
A Convolutional Layer convolves, for example, an input image or map “I” (in general nD) with a kernel “W” (in general n+1D) and adds a bias term “b” (in general nD) to it. The output is given by:P=I*W+b where * operator is (n+1)D convolution in general. Typically, n=3, but for time series applications, n could be 4. The convolution output P is then typically passed through an activation function. During training, the kernel and bias parameters are selected to optimize an error function of the network output. Convolution layers are used to create feature maps that, later in the processing chain, are interpreted by fully connected layers. Typically, multiple convolution layers are employed to generate in turn multiple feature maps.
A Fully Connected Layer is similar to classical Neural Network (NN) layers where all the neurons in a layer are connected to all the neurons in their subsequent layer. The neurons give the summation of their input multiplied by their weights and this is then passed through their activation functions.
In the example of FIG. 1, the last fully connected layer 14-2 comprises 3 output neurons, each providing a value for each category of object the network is attempting to identify in an input image (or image window).
Note that fully connected layers can be regarded as a form of convolution layer where the kernel size (width×height×channels) is equal to the size of the (multi-channel) map to which it is applied and for the purposes of simplicity, the term CNN is employed when implementing such layers.
A Pooling Layer applies a (usually) non-linear transform (Note that “average pooling” is a linear transform, but the more popular “max-pooling” operation is non-linear) on an input image to reduce the size of the data representation after a previous operation. It is common to put a pooling layer between two consecutive convolutional layers. Reducing the spatial size leads to less computational load and also prevents over-fitting as well as adding a certain amount of translation invariance to a problem. These layers are particularly important to reduce the complexity of parameters, computation and size of intermediate layers that need to be saved in a temporary memory between layers.
Regularization (not illustrated in FIG. 1) prevents overfitting inside a network. One can train a more complex network (using more parameters) with regularization and prevent over-fitting while the same network would get over-fitted without regularization. Different kinds of regularizations have been proposed including: weight regularization, drop-out technique and batch normalization. Each of them has their own advantages and drawbacks which make each one more suitable for specific applications.
As convolution layers require a lot of multiply-accumulate type of instructions, running CNNs on general purpose processing architectures (e.g. CPU/DSP) requires a large amount of processing power.
In order for modern application processors to support CNNs, there would be a need to: upgrade the memory subsystem (to provide much larger bandwidth); allow for larger power consumption (as a consequence); and upgrade processing power (TFLOPs processing power to run a decent real-time network).
To run the lightest typical object detector network architecture, a given system would need to have roughly one extra LPDDR4 (low power double data rate) 32 bit channel and an upgraded application processor capable to run anywhere from 100 GMACs (equivalent of several quad core ARM Cortex A75) to several TMACs (several tens of ARM Cortex A75 cores).
Referring to FIG. 2, PCT Publication No. WO 2017/129325 (Ref: FN-481-PCT) discloses a programmable CNN (PCNN) engine 30′ providing a general purpose neural processor, capable of loading its networks (e.g. equivalent of the program in a traditional CPU) through a separate memory channel 38 from the input image data 36 and results 39. The PCNN is an iterative processor that executes the networks layer by layer. The configuration of the layers, whether specifically convolutional, fully-connected, pooling or un-pooling, the weights types and values and the neuron's activation functions are all programmable via the network definition.
Use of such a processor 30′ in conjunction with a separate host processor 50 can reduce the power requirements for a system while enabling the network to operate in real-time on streams of images. However, the engine 30′ needs to be integrated within a system at manufacture.
Separately, it has been proposed to move all the logic and temporary memory required by a CNN to a high speed peripheral and build a co-processor with dedicated CNN accelerators and very large local memory bandwidth, for example, the Deep Learning Inference Accelerator (DLIA) from Intel Corporation; or the neural network accelerator on a USB stick provided by Movidius. However, these solutions require host kernel drivers and so are not readily deployed with existing systems.