The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for implementing matrix multiplication operations with data pre-conditioning in a high performance computing architecture.
In many prior art data-parallel Single Instruction Multiple Data (SIMD) vector architectures, algorithms have been developed that have either had to use data re-arrangement in the core, at the cost of increase of instruction bandwidth, or special data memory layouts, possibly including data duplication, requiring both increased data memory bandwidth (to load the duplicated values) and increased instruction bandwidth (to put initial data values in the duplicated format).
In modern SIMD vector architectures, data bandwidth is at a premium and often limits total performance. Thus it is desirable to reduce the required data bandwidth necessary to achieve the full performance potential of a microprocessor implementing a specific algorithm. Furthermore, in many modern architectures, instruction issue capability is at a premium. Thus, oftentimes, when an instruction of one type is issued, an instruction of another type cannot be issued. Thus, in one implementation, either a data reorganization instruction (such as splat or permute) can be issued, or a compute Floating Point Multiply Add (FMA) instruction can be used. Invariably, when data reorganization instructions are necessary, the microprocessor cannot achieve its full peak performance potential as expressed in FLOPS.
Furthermore, because of these significant limitations, and in particular due to the limited data layout and element-wise computation nature of SIMD vector architectures, in practice, SIMD instruction set architectures have not realized their full performance potential for complex arithmetic. While some architectures have attempted to remedy this with a paired floating point instruction set, these paired floating point instruction sets have required all-to-all communication between the floating point units, and thus, have resulted in severe limitations in their performance. The fastest paired floating point design has not been able to exceed an operational frequency of 1 GHz. Moreover, the limitations inherent in this type of architecture made the architecture unscalable due to practicality when more than two arithmetic units are to be used. In contrast, architectures with true SIMD implementations, such as the Cell Synergistic Processing Element (SPE), available from International Business Machines (IBM) Corporation of Armonk, N.Y., have exceeded well over 3 GHz.