Compression of Artificial Neural Networks
Artificial Neural Networks (ANNs), also called NNs, are a distributed parallel information processing models which imitate behavioral characteristics of animal neural networks. In recent years, studies of ANNs have achieved rapid developments, and ANNs have been widely applied in various fields, such as image recognition, speech recognition, natural language processing, weather forecasting, gene expression, contents pushing, etc.
In neural networks, there exists a large number of nodes (also called neurons) which are connected to each other. Neural networks have two features: 1) Each neuron calculates the weighted input values from other adjacent neurons via certain output function (also called Activation Function); 2) The information transmission intensity between neurons is measured by so-called weights, and such weights might be adjusted by self-learning of certain algorithms.
Early neural networks have only two layers: the input layer and the output layer. Thus, these neural networks cannot process complex logic, limiting their practical use.
As shown in FIG. 1, Deep Neural Networks (DNNs) have revolutionarily addressed such defect by adding a hidden intermediate layer between the input layer and the output layer
Moreover, Recurrent Neural Networks (RNNs) are commonly used DNN models, which differ from conventional Feed-forward Neural Networks in that RNNs have introduced oriented loop and are capable of processing forward-backward correlations between inputs. In particular, in speech recognition, there are strong forward-backward correlations between input signals. For example, one word is closely related to its preceding word in a series of voice signals. Thus, RNNs has been widely applied in speech recognition domain.
However, the scale of neural networks is exploding due to rapid developments in recent years. Some of the advanced neural network models might have hundreds of layers and billions of connections, and the implementation thereof is both calculation-centric and memory-centric. Since neural networks are becoming larger, it is critical to compress neural network models into smaller scale.
For example, in DNNs, connection relations between neurons can be expressed mathematically as a series of matrices. Although a well-trained neural network is accurate in prediction, its matrices are dense matrices. That is, the matrices are filled with non-zero elements, consuming extensive storage resources and computation resources, which reduces computational speed and increases costs. Thus, it faces huge challenges in deploying DNNs in mobile terminals, significantly restricting practical use and development of neural networks.
FIG. 2 shows a compression method which was proposed by one of the inventors in earlier works.
As shown in FIG. 2, the compression method comprises learning, pruning, and training the neural network. In the first step, it learns which connection is important by training connectivity. The second step is to prune the low-weight connections. In the third step, it retrains the neural networks by fine-tuning the weights of neural network. In recent years, studies show that in the matrix of a trained neural network model, elements with larger weights represent important connections, while other elements with smaller weights have relatively small impact and can be removed (e.g., set to zero). Thus, low-weight connections are pruned, converting a dense network into a sparse network.
FIG. 3 shows synapses and neurons before and after pruning according to the method proposed in FIG. 2.
The final step of FIG. 2 involves retraining the sparse network to learn the final weights for the remaining sparse connections. By retraining the sparse network, the remaining weights in the matrix can be adjusted, ensuring that the accuracy of the network will not be compromised.
By compressing a dense neural network into a sparse neural network, the computation amount and storage amount can be effectively reduced, achieving acceleration of running an ANN while maintaining its accuracy. Compression of neural network models are especially important for specialized sparse neural network accelerator.
Speech Recognition Engine
Speech recognition is a widely applicable field of ANNs. Speech recognition is to sequentially map analogue signals of a language to a specific set of words. In recent years, methods applying ANNs have achieved much better effects than conventional methods in speech recognition domain, and have become the mainstream in the industry. In particular, DNNs have been widely applied in speech recognition domain.
As a practical example of using DNNs, a general frame of the speech recognition engine is shown in FIG. 4.
In the model shown in FIG. 4, it involves computing acoustic output probability using a deep learning model. That is, conducting similarity prediction between a series of input speech signals and various possible candidates. Running the DNN in FIG. 4 can be accelerated via FPGA, for example.
FIG. 5 shows a deep learning model applied in the speech recognition engine of FIG. 4.
More specifically, FIG. 5(a) shows a deep learning model including CNN (Convolutional Neural Network) module, LSTM (Long Short-Term Memory) module, DNN (Deep Neural Network) module, Softmax module, etc.
FIG. 5(b) is a deep learning model where the present invention can be applied, which uses multi-layer LSTM.
In the network model shown in FIG. 5(b), the input of the network is a section of voice. For example, for a voice of about 1 second, it will be cut into about 100 frames in sequence, and the characteristics of each frame is represented by a float type vector.
LSTM
Further, in order to solve long-term information storage problem, Hochreiter & Schmidhuber has proposed the Long Short-Term Memory (LSTM) model in 1997.
FIG. 6 shows a LSTM network model applied in speech recognition. LSTM neural network is one type of RNN, which changes simple repetitive neural network modules in normal RNN into complex interconnecting relations. LSTM neural networks have achieved very good effect in speech recognition.
For more details of LSTM, prior art references can be made mainly to the following two published papers: Sak H, Senior A W, Beaufays F. Long short-term memory recurrent neural network architectures for large scale acoustic modeling[C]//INTERSPEECH. 2014: 338-342; Sak H, Senior A, Beaufays F. Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition[J]. arXiv preprint arXiv:1402.1128, 2014.
As mentioned above, LSTM is one type of RNN. The main difference between RNNs and DNNs lies in that RNNs are time-dependent. More specifically, for RNNs, the input at time T depends on the output at time T-1. That is, calculation of the current frame depends on the calculated result of the previous frame.
In the LSTM architecture of FIG. 6:
Symbol i represents the input gate i which controls the flow of input activations into the memory cell;
Symbol o represents the output gate o which controls the output flow of cell activations into the rest of the network;
Symbol f represents the forget gate which scales the internal state of the cell before adding it as input to the cell, therefore adaptively forgetting or resetting the cell's memory;
Symbol g represents the characteristic input of the cell;
The bold lines represent the output of the previous frame;
Each gate has a weight matrix, and the computation amount for the input of time T and the output of time T-1 at the gates is relatively intensive;
The dashed lines represent peephole connections, and the operations correspond to the peephole connections and the three cross-product signs are element-wise operations, which require relatively little computation amount.
FIG. 7 shows an improved LSTM network model.
As shown in FIG. 7, in order to reduce the computation amount of the LSTM layer, an additional projection layer is introduced to reduce the dimension of the model.
The LSTM network accepts an input sequence x=(x1, . . . , xT), and computes an output sequence y=(y1, . . . , yT) by using the following equations iteratively from t=1 to T:it=σ(Wixxt+Wixyt−1+Wicct−1+bi)ft=σ(Wfxxt+Wfryt−1+Wfcct−1+bf)ct=ftct−1+itg(Wcxxt+Wcryt−1+bc)ot=σ(Woxxt+Woryt−1+Wtmci+bv)mt=oth(ct)yt=Wymmt 
Here, the W terms denote weight matrices (e.g., Wix is the matrix of weights from the input gate to the input), and Wic, Wfc, Woc are diagonal weight matrices for peephole connections which correspond to the three dashed lines in FIG. 7. The b terms denote bias vectors (bi is the gate bias vector), σ is the logistic sigmoid function. The symbols i, f, o, c are respectively the input gate, forget gate, output gate and cell activation vectors, and all of which are the same size as the cell output activation vectors m. ⊙ is the element-wise product of the vectors, g and h are the cell input and cell output activation functions, generally tanh.
Since the structure of the above LSTM neural network is rather complicated, it is difficult to achieve desired compression ratio by one single compression. Therefore, the inventors propose an improved multi-iteration compression method for deep neural networks (e.g. LSTM, used in speech recognition), so as to reduce storage resources, accelerate computational speed, reduce power consumption and maintain network accuracy.