Deep neural network algorithms involve a large number of matrix calculations, which generally leads to a hardware architecture involving very wide single-instruction multiple-data (SIMD) processing units and large on-chip storage. Due to the nature of deep learning, different SIMD lanes need to exchange data from time to time. A number of memory architectures exist that provide cross-lane data processing and computing, but these architectures are deficient for several reasons, such as unacceptable increases in memory access latency, in bank-conflict issues, in performance issues, etc.