Artificial intelligence applications, such as applications that train neural networks and/or utilize neural networks to make inferences (e.g., identifying an object in an image, performing voice recognition, etc.) typically utilize relatively large amounts of compute capacity to perform tensor operations (e.g., matrix calculations, such as matrix multiplication) on matrix data. In some compute devices, the compute operations to support an artificial intelligence application may be offloaded from the general purpose processor to an accelerator device, such as a graphics processing unit (GPU). However, while a GPU may be capable of performing tensor operations faster than the processor, the efficiency (e.g., energy usage and speed) with which the compute device is able to perform the operations is still hampered by the fact that the data to be operated on (e.g., matrix data) resides in memory and is sent through a bus from the memory to the device performing the compute operations (e.g., the GPU), consuming time and energy. As the complexity and amount of data to be operated on increases (e.g., with increasingly complex artificial intelligence applications), the inefficiencies of existing systems, in terms of energy usage and speed, may increase correspondingly.