In the era of big data, more and more devices are required to perform more and more complex processing on real-time input in the real world, such as, industrial robots, automatic driving of unmanned car and mobile devices, etc. These tasks mostly pertain to the machine learning field, where most operations are vector operations or matrix operations, which have a high degree of parallelism. As compared to the traditional common GPU/CPU acceleration scheme, the hardware ASIC accelerator is the most popular acceleration scheme at present. On one hand, it can provide a high degree of parallelism and can achieve high performance, and on the other hand, it has high energy efficiency.
However, the bandwidth becomes a bottleneck that limits the performance of the accelerator, and the common solution is to balance disequilibrium of the bandwidth through a cache positioned on the chip. These common solutions do not optimize data reading and writing, and cannot better utilize characteristics of the data, such that the on-chip storage overhead is too much, and overhead of data reading and writing is too much. As for current common machine learning algorithms, most of the data have reusability, i.e., the same data will be used for many times, such that the data has the characteristics of repetitive addressing for many times, such as, a weight in the neural network.
In conclusion, the prior art obviously has inconvenience and defects in practical use, so it is necessary to make improvement.