The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for performing a floating point collect and operate for a summation across a vector for a dot product operation.
The dot product, also known as the scalar product, is an operation that takes two vectors over the real numbers R and returns a real-valued scalar quantity. The dot product is the standard inner product of the orthonormal Euclidean space. It contrasts with the cross product which produces a vector result. The dot product of vectors a=[a1, a2, . . . , an] and b=[b1, b2, . . . , bn] is defined as follows:
      a    ·    b    =                    ∑                  i          =          1                n            ⁢                          ⁢                        a          i                ⁢                  b          i                      =                            a          1                ⁢                  b          1                    +                        a          2                ⁢                  b          2                    +      …      +                        a          n                ⁢                  b          n                    where Σ denotes summation and n is the dimension of the vectors. Thus, the dot product represents a mathematical operation that requires computing a summation across a vector.
Multimedia extensions (MMEs) have become one of the most popular additions to general-purpose microprocessors. Existing multimedia extensions can be characterized as single instruction multiple data (SIMD) path units that support packed fixed-length vectors. When an operation requiring a summation across a vector takes advantage of a SIMD instruction set, the operation must operate across the SIMD elements. Some operations, such as floating point addition, are very expensive, particularly in terms of latency.
An addition of four floating point values from a SIMD operand requires two adders in a first stage and a single adder in the second stage. Thus, the pipeline required to execute the floating point summation of a SIMD operand is twice as long as required to perform an addition of two SIMD words. Furthermore, the average result throughput is one result per cycle in comparison to the SIMD operation that could produce four results per cycle.