Introduction to RNN
Recurrent Neural Network (RNN) is a class of artificial neural network where connections between units form an oriented loop. This creates an internal state of the network which allows it to exhibit dynamic temporal behavior. RNNs can handle the variable-length sequence by having a recurrent hidden state whose activation at each time is dependent on that of the previous time.
Traditionally, the standard RNN computes hidden layer at next step as:ht=f(W(hh)ht−1+W(hx)xt)
where f is a smooth, bounded function, such as a logistic sigmoid function or a hyperbolic tangent function. W(hh) is the state-to-state recurrent weight matrix, and W(hx) is the input-to-hidden weight matrix.
Input sequence is x=(x1, . . . , xT). We can divide the probability of a sequence of arbitrary length into:p(x1, . . . ,xT)=p(x1)p(x2|x1)p(x3|x1,x2) . . . p(x1, . . . ,xT−1)
Then, as shown in FIG. 1, we can train an RNN to model this probability distribution, and predict the probability of the next symbol xt+1, given hidden states ht which is a function of all the previous symbols x1, x2, . . . xt.p(xt+1|x1, . . . ,xt)=f(ht)
The hidden layer activations are computed by iterating the following equations from t=1 to T and from n=2 to N:ht1=(Wth1xt+Wh1h1ht−11+bh1)htn=(Wihnxt+Whn−1hnhtn−1+Whnhnht−1n+bhn)
where the W terms denote weight matrices (e.g. Wihn is the weight matrix applied to the inputs to the nth hidden layer, Wh1h1 is the recurrent weight matrix at the first hidden layer, and so on), the b terms denote bias vectors (e.g. by is output bias vector) and H is the hidden layer function.
Given the hidden sequences, the output sequence is computed as follows:
                    𝓎        ^            t        =                  b        𝓎            +                        ∑                      n            =            1                    N                ⁢                                  ⁢                              W                                          h                n                            ⁢              𝓎                                ⁢                      h            t            n                                          𝓎      t        =          y      ⁡              (                              𝓎            ^                    t                )            
where y is the output layer function. The complete network therefore defines a function, parameterized by the weight matrices, from input histories x 1:t to output vectors yt.
FIG. 2 shows a simplified basic network frame of RNNs, wherein the output of the previous hidden layer is the input of the present hidden layer. That is, the output of the present layer is related to both the hidden layer of the previous layer and the input of the present layer.
Compression of Neural Networks
In recent years, the scale of neural networks is exploding. Advanced neural network models might have billions of connections and the implementation thereof is both calculation-centric and memory-centric.
The conventional solutions typically use a general purpose CPU or GPU (graphic processing unit) to realize related algorithms. However, it is not clear how much potential can be further developed in the processing capabilities of conventional chips like CPU and GPU, as Moore's Law might fail one day. Thus, it is of critical importance to compress neural networks into smaller scale neural networks so as to reduce computation and memory consumption.
On the other hand, customized circuit can solve the above-mentioned problem, so that the customized processor can achieve better acceleration ratio in implementing a sparse neural network model.
One purpose of the present invention is to provide a customized hardware accelerator with a parallelized pipeline design. The hardware accelerator is especially suitable for sparse neural networks, and can achieve better computation efficiency while reduce processing delay.
CRS and CCS
For a sparse matrix, which is typically obtained after being compressed, it is desired to encode the matrix in order to further reduce the memory requirements. It has been proposed to encode and store sparse matrix by Compressed Row Storage (CRS) or Compressed Column Storage (CCS).
In the present application, in order to exploit the sparsity of compressed neural networks, it might encode and store the sparse weight matrix W in a variation of compressed column storage (CCS) format.
For each column Wj of matrix W, it stores a vector v that contains the non-zero weights, and a second, equal-length vector z. Vector z encodes the number of zeros before the corresponding entry in v. Each entry of v and z is represented by a four-bit value. If more than 15 zeros appear before a non-zero entry we add a zero in vector v.
For example, it encodes the following column [0,0,1,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3], as v=[1,2,0,3], z=[2,0,15,2].
v and z of all columns are stored in one large pair of arrays with a pointer vector p pointing to the beginning of the vector for each column. A final entry in p points one beyond the last vector element so that the number of non-zeros in column j (including padded zeros) is given by pj+1-pj.
Storing the sparse matrix by CCS format makes it easy to exploit matrix sparsity. It simply multiplies each non-zero activation by all of the non-zero elements in its corresponding column.
For further details regarding the storage of a sparse matrix, please refer to U.S. Pat. No. 9,317,482, UNIVERSAL FPGA/ASIC MATRIX-VECTOR MULTIPLICATION ARCHITECTURE. In this patent, it proposes a sparse matrix representation, referred as the Compressed Variable Length Bit Vector (CVBV) format, which is used to take advantage of the capabilities of FPGAs and reduce storage and band width requirements across the matrices. Also, it discloses a class of sparse matrix formats that are better suited for FPGA implementations in reducing storage and bandwidth requirements. A partitioned CVBV format is described to enable parallel decoding.