Artificial intelligence (AI) processing has been a popular topic recently, both in terms of computationally and memory intensive, as well as high performance-power efficiency. Accelerating computing with current devices such as CPUs and GPUs is not easy, and many solutions such as GPU+TensorCore, tensor processing unit (TPU), central processing unit (CPU)+field programmable gate array (FPGA), and AI application-specific integrated circuit (ASIC) have been proposed to address these problems. GPU+TensorCore tends to focus on solving computationally intensive problems, while TPU tends to focus on computation and data reuse, and CPU+FPGA/AI ASICs focus on improving performance-power efficiency.
In artificial intelligence processing, many data are zero due to neuron activation and weight pruning. In order to use these sparsity, it is necessary to propose a compression method that achieve one or more of the following benefits: may save computation, may save power consumption while skipping zero neurons or convolution weights, may reduce the required buffer storage space, and may increase DRAM bandwidth by not transmitting zero data.
Although there are many similar solutions currently, they only use a single layer compression scheme, and this does not have obvious advantages. With a bit mask of two or more layers, if the advanced mask is 0, we can easily remove high-level data, which means that all of the branches are zero, but the traditional single-layer mask Compression cannot get this result.