Multilayer neural networks (MNN) are widely applied to the fields such as pattern recognition, image processing, functional approximation and optimal computation. In recent years, due to the higher recognition accuracy and better parallelizability, multilayer artificial neural networks have received increasing attention by academic and industrial communities. Two main MNN computing processes are forward propagation and backpropagation. The output data of the forward propagation process may be shown as y=f(wx+b), in which w is the weight matrix that includes multiple weight values, x is the input data stored in a form of matrix, b is a bias value, and f( ) is an activation function. In the forward propagation process, the multiplication of the weight matrix w and the input data matrix may cause high complexity than adding a bias value and perform the activation function.
A known method to perform the matrix multiplication of a multilayer artificial neural network is to use a general-purpose processor. Such a method uses a general-purpose register file and a general-purpose functional unit to execute general-purpose instructions to support algorithms in MNNs. However, one of the defects of the method is low operational performance of a single general-purpose processor which cannot meet performance requirements for usual multilayer neural network operations. When multiple general-purpose processors execute concurrently, the intercommunication among them also becomes a performance bottleneck.
Another known method to perform the matrix multiplication of the multilayer artificial neural network is to use a graphics processing unit (GPU). Such a method uses a general-purpose register file and a general-purpose stream processing unit to execute general purpose single-instruction-multiple-data (SIMD) instructions to support the algorithms in MNNs. However, since GPU only contains rather small on-chip caching, then model data (weight values) of a multilayer artificial neural network may be repeatedly moved from the off-chip, and off-chip bandwidth becomes a main performance bottleneck, causing huge power consumption.