Single-instruction multiple-data systems process multiple digital data streams by executing a particular operation in parallel for each data stream. One commonly used example is the multiply-accumulate operation, which computes the product of two numbers inputted in a given format and adds that product to an accumulator. An addend may also be specifically provided as an input. Thus, the multiply-accumulate operation may generally evaluate the expression: (A*B)+C.
A group of multiply-accumulate operations may be applied concurrently to a vector of inputs representing a set of data streams, resulting in a vector of outputs. Thus, in a conventional C-language representation, such a group of multiply-accumulate operations may be described as:
for (int i=0; i<16; i++)
{
result.val[i]=a.val[i]*b.val[i]+c.val[i];
}
Multiplications may be understood simply as a set of repeated additions that may be performed by any general-purpose processor, but in practice this multiplication design approach can be quite time-consuming. Many computing situations require high throughput multiply-accumulate operations, so these operations are often performed by dedicated hardware instead of regular processors. Hardware designers must not only minimize power consumption and integrated circuit area, but balance the use of dedicated hardware against the workload removed from regular processors.
Multiply-accumulate operations are often needed where incoming data sets are of variable precision, and may be signed (e.g., positive or negative in value). For example, nearly every digital signal processor uses digital hardware multipliers, for demanding multimedia applications like desktop video conferencing, which requires audio/image/video processing perhaps including 3-D graphics, speech recognition, and wireless communications. Multiply-accumulator units must therefore be configurable to handle the variety of incoming data formats (e.g., 8-bit bytes, 16-bit integers, 24-bit words, and 32-bit doubles) used in each aspect of the overall computing task. This universality or reconfigurability constraint complicates the hardware design problem.
Further, because multiply-accumulate operations may be repeated many times during the processing of a given data stream, accumulator overflow is not uncommon. The provision of increased accumulator data width to avoid overflow unfortunately usually requires extra hardware and delay.
Accordingly, the inventors have developed a novel apparatus and method for flexibly performing multiply-accumulate operations for incoming operands provided in a variety of formats, at high speed, and with low power consumption.