Binary adder circuits are well known in the art as fundamental components of computer processing architectures. Many adder implementations exist to provide advantages with respect to processing speed, design efficiency (i.e., minimize size of resulting circuitry), and other architectural parameters.
In the oldest (and still most common) type of processing, a processor executes a single instruction stream that operates upon a single data stream. This type of processing is commonly referred to as “Single Instruction Single Data,” or “SISD,” and is utilized by all personal computers, and nearly all other computers except those designed and fabricated until fairly recently. SISD processing is useful for a large number of applications, but can be slow and awkward for certain types of numerical operations, such as array and matrix operations. For example, suppose a processor must evaluate the sum of a 10 element array A and a constant k, i.e., k+A. With SISD processing, the processor must evaluate each element of the array A serially, i.e., k+A(0), k+A(1) . . . k+A(8), and k+A(9), in order to produce the resulting sum.
A more efficient processing scheme for handling such array structures is commonly referred to as “Single Instruction Multiple Data,” or “SIMD.” In SIMD processing, a processor executes a single instruction stream that operates on multiple data streams simultaneously. Typically, source and destination registers hold data in the form of bit sequences. An SIMD processor can treat the data as one or a plurality of data elements, each defining a separate SIMD lane. The size of the SIMD lane employed is defined in the instruction and can change from one instruction to the next. The processor may include multiple CPUs operating in parallel on the different elements of the array. Such processing architectures typically operate synchronously, i.e., each individual processor operates in lock-step with the other CPUs, such that the CPUs all execute a particular instruction in lock-step. Using the example above, in an SIMD processor the first CPU would perform the sum for the first element of the array A (i.e., A(0)+k), the second CPU would perform the sum for the second element of the array A (i.e., A(1)+k), etc., such that all 10 elements of the array A would be evaluated in one instruction period.
Generally, prior art adders are characterized by a fixed data width. With an SISD processing architecture, the processor handles one data element at a time, so a fixed-width adder is typically selected to match the expected data width of the data element. If the data element is smaller than the selected adder, the data element can be padded with zeros to match the size of the adder.
Recent SIMD architectures have included instructions that request a target adder to operate in multiple summing modes, depending on the size of the data elements being processed. For example, an SIMD processing architecture may include a 64 bit adder, and a particular SIMD instruction may require the adder to produce the sum of two 64-bit numbers. A subsequent SIMD instruction may require the same adder to simultaneously add four 16-bit numbers from an array. Yet another SIMD instruction may require the adder to produce the sum of eight eight-bit numbers. Thus, in this example, the same adder would be required to implement a single 64-bit data summing lane, four 16-bit data summing lanes, and eight eight-bit summing lanes. Prior art SIMD architectures have implemented such multi-mode summers by adding logic beyond the point where the data signals cross the summing lanes to simultaneously produce multiple summing mode results, then mixing the different results via a multiplexor to the output of the adder. The particular summing mode result the multiplexor selects thus depends upon the summing mode requested by the SIMD instruction.
One disadvantage to this configuration is that the additional circuitry needed to simultaneously produce the multiple summing modes and multiplex the results adds size and complexity to the adder.
Another disadvantage to this configuration is that the additional circuitry adds to the propagation delay through the adder.
It is an object of the present invention to substantially overcome the above-identified disadvantages and drawbacks of the prior art.