Field of the Invention
The invention relates to matrix multipliers and in particular to a residue number matrix multiplier and methods therefor.
Related Art
The use of Convolutional Neural Networks (CNN's) has exploded due to emerging technologies such as autonomous vehicles and cloud-based AI. Unfortunately, the intense numerical processing demands of CNN's place heavy workload on servers using general purpose CPU and GPU technology; this translates to high power consumption and cost. Factors such as the slowing of Moore's law, the need to save power, and the ever-increasing demand for compute capability create opportunities for hardware accelerators that are streamlined to solve specific problems.
One type of circuit for AI acceleration is a so-called hardware matrix multiplier, i.e., a systolic array of multiplier-accumulators coupled to perform matrix multiplication. The advantage of the matrix multiplier is derived from the massive parallelism afforded by a two-dimensional array of processing elements and is also due to the streamlined flow of matrix data to the many processing elements.
The mapping of neural network algorithms to systolic array architectures was proposed and analyzed by S. Y. Kung and others in the early 1990's. S. Y. Kung re-formulates the retrieving phase of neural networks by mapping it to consecutive matrix multiplication interleaved with a non-linear activation function. In another adaptation, 2D-convolution used in AI pattern recognition is mapped to matrix multiplication by re-ordering input data flow.
Recently, a systolic architecture for processing CNN's called the Tensor Processing Unit (TPU) was developed by Google Inc. The TPU uses a 256×256 element matrix multiplier coupled to circuits enabling data pooling, normalization, and application of a non-linear activation function. The TPU significantly accelerates the inference phase of CNN's by supporting a minimal operand precision, but it does not support the precision required for training phases. The problem is exasperated when developing neural network weights during training phases of the CNN's, since the same TPU hardware cannot be used to train the network.
Moreover, convolution algorithms have been found to be sensitive to limited numerical precision.
From the discussion that follows, it will become apparent that the present invention addresses the deficiencies associated with the prior art while providing numerous additional advantages and benefits not contemplated or possible with prior art constructions