With extensive use of electronic devices, in the era of big data, more and more devices are required to perform more and more complex processing on real-time input of the real world, such as, industrial robots, automatic driving of unmanned car and mobile devices, etc. These tasks are mostly partial to the machine learning field, where most operations are vector operations or matrix operations, which have a high degree of parallelism. As compared to the traditional common GPU/CPU acceleration scheme, the hardware ASIC accelerator is the most popular acceleration scheme at present. On one hand, it can provide a high degree of parallelism and can achieve high performance, and on the other hand, it has high energy efficiency.
However, the bandwidth becomes a bottleneck that limits the performance of the accelerator, and the common solution is to balance disequilibrium of the bandwidth through a cache positioned on the chip. These common solutions do not optimize reading and writing of the data, and cannot better utilize characteristics of the data, such that cost of the on-chip storage is too much, and cost of data reading and writing is too much. As for current common machine learning algorithms, most of the data have reusability, i.e., the same data will be used for many times, such that the data has the same part, such as, a weight in the neural network.
In conclusion, the prior art obviously has inconvenience and defects in practical use, so it is necessary to make improvement.