1. Field of the Invention
The present invention generally relates to achieving near peak performance on a SIMD (Single Instruction Multiple Data) vector machine. Reflecting implementation constraints, many, if not all, of these current machines do not implement a vector times a scalar fused multiply add or a cross load with or without negation instruction as a co-processor unit instruction for its Fused Multiply Add (FMA) units that would provide ideal loading of data for executing the complex or real matrix operation _GEMM, namely C,ZGEMM or S,DGEMM. More specifically, the effect of these instructions for efficient matrix multiplication of complex data is achieved in the present invention, using a plurality of other available instructions to preprocess data in the co-processor unit data registers to be input into pipelined FMAs, by at least one of selectively duplicating real and imaginary parts and selectively reversing real and imaginary components or just selectively duplicating real or imaginary parts.
2. Description of the Related Art
The present invention is generally directed to any computer architecture which lacks the hardware instruction or instructions, such as a vector times a scalar fused multiply add or cross load with or without negation instructions, for use in a co-processing unit loading operation for its fused multiply add (FMA) units. It is also directed to any computer architecture in which there is no negative butterfly load instruction, an instruction that that receives operand (a,b) as input, to provide a result (−b, a). These operations are important for loading data for matrix processing in _GEMM when the matrix data consists of complex numbers, since i*i=−1 produces a sign change in the real component of the product, an effect that can be addressed by loading data into the FMA using one or more of these instructions typically not provided in many modern SIMD instruction sets.
General Matrix Multiply (_GEMM) is a subroutine in the Basic Linear Algebra Subprograms (BLAS) which performs matrix multiplication, that is, the multiplication of two matrices with update C=C±A·B. This includes: SGEMM for single precision, DGEMM for double-precision, CGEMM for complex single precision, and ZGEMM for complex double precision.
Ideally, for optimal efficiency in terms of minimizing the number of instructions, data for vector multiplication using complex numbers would be loaded into vector FMAs using a hardware loading instruction that places the data into the vector FMA correctly to achieve the negation occurring in the real component, so that the vector FMA can then routinely perform vector multiplication that provides this sign change effect for the real component. Without such a data loading hardware instruction for complex numbers, including a negation, the straightforward method to deal with the sign change effect would involve using additional stages of processing, thereby possibly making it less efficient overall.
FIG. 1 exemplarily illustrates a computer architecture 100 suitable for matrix processing. The central processing unit (CPU) 101 executes logical and, possibly, arithmetic operations, but, more typically, in more recent computer architectures, at least one co-processor 102 is attached to the CPU function, as dedicated to specific forms of processing, such as the vector fused multiply add (FMA) unit (e.g., a co-processing unit) exemplarily indicated in FIG. 1. It is noted that co-processor 102 might also be a floating-point unit (FPU) when the vector length is one. Both the CPU 101 and FMA 102 have respective register sets 103, 104 for exchanging data and serving as a workspace, or, more typically, the register set is commonly shared between the CPU and FMA. Typically, in recent SIMD machines, to which the present invention is particularly addressed, there will be more than one CPU 101 and more than one vector FMA unit 102, where the vector FMAs comprise scalar FMAs that are controlled in a pipelined manner to provide processing for each vector component. The present invention is also directed toward machines expected to be selectively able to perform vector multiplication on complex data, thereby adding the complication of achieving the effect of multiplying i·i=−1.
Typical architectures will also include one or more levels of cache, such as the L1 cache 106 and L2 cache 107, as well as possibly higher level caches 108, and a main memory 105. It is also noted that the present inventors also refer to the working register set 103, 104 for the CPU/FMA as the “L0 cache.”
The precise details of the exemplary architecture are not so important to understanding the present invention so much as understanding that the present invention is directed to modern machines in which the matrix processing uses a co-processor 102 and which the present inventors refer to as a “SIMD machine”, which has various forms of “strictness.” The architectures to which the present invention include the Power ISA with either a VMX capability or Vector Scalar Extension (VSX) capability, which will be used in examples herein and will be used to provide specific aspects of exemplary embodiments. There are also considered conceptual extensions to these architectures, such as a theoretical VMX extension supporting DP arithmetic.
In another exemplary embodiment, this invention is directed at more generalized SIMD architectures based upon VMX or VSX features, but containing more data elements within a wide vector, and having appropriately adapted instruction sets to compute and rearrange data on such a wide vector exceeding the current VMX or VSX vector width of 128 bits. For purpose of this exposition, these conceptually extended instruction sets may be similarly referred to as VMX or VSX instruction sets without limitation to the applicability of this invention to a wide range of SIMD instruction sets having like features.
In yet other exemplary embodiments, the inventions described herein could be applied to Intel and AMD architecture systems with MMX and/or SSE capabilities, or other processors with data-parallel SIMD processing capabilities.
More precisely, what “strict SIMD machine” means in the context of the present invention is that the co-processor function used to perform vector FMA's has neither vector scalar FMA nor cross load (with or without negation instructions), meaning that there is no vector FMA instructions such that, given received operands (a,b) and (c,d) as inputs, the operand results (−b,a), (c,c) and (d,d) are provided, as would be required for complex matrix multiplication, since multiplication of two complex numbers requires that (a+bi) (c+di)=(ac−bd)+(bc+ad)i. For real vector scalar multiplication where V=(a, b) and s=c there is no instruction that produces (ac, bc).
That is, given complex matrices, A and B, respectively, having complex elements aij=a+bi and bij=c+di, double precision complex matrix multiplication using ZGEMM requires:
  C  =            C      +              /            -              A        ·        B              =                            c          ij                +                  /                -                              a            ik                    ·                      b            jk                              =                        c          ij                +                  /                -                              (                                                                                ac                    -                    bd                                                                                                                    ad                    +                    bc                                                                        )                    .                    
From the above ZGEMM equation, it can be seen that, as exemplarily demonstrated, a vector FMA consists of two scalar FMAs, one that calculates the real component ac−bd and the other that calculates the imaginary component ad+bc. Each scalar FMA is capable of receiving operands (a,b) and (c,d). It should be clear that there may be additional FMAs, so that vector multiplication on vectors having different vector components can be occurring in parallel.
The advantage of having, for example, the cross load with negation instruction is that it can be used to selectively load the operands for the scalar FMAs to provide the appropriate inputs necessary for the above ZGEMM processing to achieve the effect of the negation of the real component.
The present inventors have recognized that there are possible improvements to further optimize efficient processing of this multiplication as occurring in a “strict SIMD machine” in which there is no hardware vector FMA loading instruction that would automatically load operand data into the vector FMA so that the scalar FMA would provide the negation that results from i·i=−1 during complex matrix multiplication, since additional processing of the multiplication result is necessary to achieve this negation effect if there is no hardware loading instruction that provides some form of negation during loading.
Thus, a need exists in such machines lacking such hardware loading instructions for a mechanism to improve efficiency of matrix multiplication using complex data.