The performance of a variety of different applications is controlled by multiply and accumulate (MAC) operations. For instance, the performance of neuromorphic computing and machine learning applications is determined by the efficiency with which MAC operations are performed. Accordingly, several different hardware solutions have been explored and developed to increase the efficiency with which MAC operations are performed.
Graphics processing units (GPUs) are commonly utilized to perform MAC operations because the highly parallelized architecture of GPUs offers developers the ability to perform many multiplications in parallel. Accordingly, GPUs are generally capable of outperforming central processing units (CPUs) at performing MAC operations.
Recently, dedicated digital neuromorphic ASICs (e.g., tensor processing units (TPUs)) have been developed that are capable of outperforming GPUs because the architecture of these dedicated digital neuromorphic ASICs have been optimized for MAC operations. Additionally, neuromorphic applications can commonly tolerate lower precision (e.g., 8-bit or lower) than is typically required from GPUs, and therefore neuromorphic ASICs may achieve increased performance compared to GPUs by performing reduced precision multiplication operations.
However, performing MAC operations digitally is relatively expensive compared to analog implementations, particularly when the MAC operation is a vector multiplied by a matrix, as in the case of neural networks.
Additionally, for applications that require large neural nets, there can be a substantial latency and power penalty incurred when transferring the weights to and from memory due to memory bottleneck. These memory bottlenecks that lead to expensive transfers of weights may be reduced by increasing the cache/memory on-board.