Central processing units (CPUs) are not generally configured to efficiently perform matrix multiplication between sparse arrays (e.g., sparse vectors and sparse matrices) and dense vectors. CPUs with advanced vector extension (AVX) units generally perform sparse matrix times dense vector and sparse vector times dense vector multiplication operations by gathering from and scattering to a dense vector data structure. However, existing architectures include two read ports and one write port in the level 1 (L1) cache. Accordingly, the gather throughput for the L1 cache is two 4 Byte reads per clock cycle if data is spread across more than two cache lines. The L2 cache has one read port, and that read port may be shared by multiple cores. Accordingly, the gather throughput in many processors is 0.5-1 4 Byte word per clock cycle. Similarly, with just one write port in the L1 and L2 caches, the scatter throughput is the same or lower. Accordingly, existing CPUs have a hardware bottleneck for performing gathers and scatters.