Artificial neural network (ANN), in particular, convolutional neural network (CNN) has achieved great success in various fields. For example, in the field of computer vision (CV), CNN is widely used and most promising.
State-of-the-Art CNN Models
In ILSVRC 2012, the SuperVision team won the first place in image classification task using AlexNet by achieving 84.7% top-5 accuracy. CaffeNet is a replication of AlexNet with minor changes. Both of AlexNet and CaffeNet consist of 5 CONV layers and 3 FC layers.
The Zeiler-and-Fergus (ZF) network achieved 88.8% top-5 accuracy and won the first place in image classification task of ILSVRC 2013. The ZF network also has 5 CONV layers and 3 FC layers.
FIG. 1 shows a typical convolutional neural network model.
As shown in FIG. 1, a typical CNN consists of a number of layers that run in sequence.
The parameters of a CNN model are called “weights”. The first layer of a CNN reads an input image and outputs a series of feature maps. The following layers read the feature maps generated by previous layers and output new feature maps. Finally a classifier outputs the probability of each category that the input image might belong to.
CONV layer and FC layer are two essential types of layer in CNN. After CONV layers, there are usually pooling layers.
For a CNN layer, fjin denotes its j-th input feature map, fiout denotes the i-th output feature map, and bi denotes the bias term to the i-th output map.
For CONV layers, nin and nout represent the number of input and output feature maps respectively.
For FC layers, nin and nout are the length of the input and output feature vector.
A CONV layer takes a series of feature maps as input and convolves with convolutional kernels to obtain the output feature map.
A nonlinear layer, which applies nonlinear activation function to each element in the output feature maps is often attached to CONV layers.
The CONV layer can be expressed with Equation 1:fiout=Σj=1ninfjin⊗gi,j+bi (1≤i≤nout)  (1)
where gi,j is the convolutional kernel applied to j-th input feature map and i-th output feature map.
FC layer applies a linear transformation on the input feature vector:fout=Wfin+b  (2)
where W is an nout×nin transformation matrix and b is the bias term. It should be noted, for the FC layer, the input is not a combination of several 2-D feature maps but just a feature vector. Consequently, in Equation 2, the parameter nin and nout actually corresponds to the lengths of the input and output feature vector.
Pooling layer, which outputs the maximum or average value of each subarea in each feature maps, is often attached to the CONV layer. Max-pooling can be expressed as Equation 3:
                              f                      i            ,            j                    out                =                              max                          p              ×              p                                ⁢                      (                                                                                f                                          m                      ,                      n                                        in                                                                    …                                                                      f                                          m                      ,                                              n                        +                        p                        -                        1                                                              in                                                                                                ⋮                                                                                                                                          ⋮                                                                                                  f                                                                  m                        +                        p                        -                        1                                            ,                      n                                        in                                                                    …                                                                      f                                                                  m                        +                        p                        -                        1                                            ,                                              n                        +                        p                        -                        1                                                              in                                                                        )                                              (        3        )            
where p is the pooling kernel size. This non-linear “down sampling” not only reduces the feature map size and the computation for later layers, but also provides a form of translation invariance.
Compared to conventional algorithms, CNN requires larger computation amount and bandwidth. In prior art, it typically uses a CPU (Central Processing Unit) or GPU (graphic processing unit) to implement a CNN. However, CPU or GPU fails to fully adapt to the characteristics of CNN, leading to lower computation efficiency and higher power consumption and cost.
Therefore, it is desired to develop an accelerator for neural networks which addresses the above mentioned problems.