1. Field of the Invention
The present application relates generally to an improved data processing apparatus and method and more specifically to a mechanism to optimize corner turns for local storage and bandwidth reduction.
2. Background of the Invention
The Cell Broadband Engine (Cell/B.E.) architecture contains a hierarchical memory subsystem consisting of generalized system memory and specialized synergistic processor element (SPE) local storage (LS). Data is transferred between these two memory domains via direct memory access (DMA) operations serviced by the SPE's memory flow controller (MFC). Block matrix multiplication is performed on the Cell/B.E. Double buffering techniques are used by the SPEs to hide the latency of data transfers.
In the mathematical discipline of matrix theory, a block matrix or a partitioned matrix is a partition of a matrix into rectangular smaller matrices called blocks. Looking at it another way, the matrix is written in terms of smaller matrices horizontally and vertically adjacent. A block matrix must conform to a consistent way of splitting up the rows and the columns. The partition is into the rectangles described by one bunch of adjacent rows crossing one bunch of adjacent columns. In other words, the matrix is split up by some horizontal and vertical lines that go all the way across.
The general matrix multiply (GEMM) is a subroutine in the basic linear algebra subprograms (BLAS) which performs matrix multiplication that is the multiplication of two matrices. Double precision is a computer numbering format that occupies two adjacent storage locations in computer memory. A double precision number, sometimes simply called a double, may be defined to be an integer, fixed point, or floating point. For example, computers with 32-bit stores (single precision) may provide 64-bit double precision. A double precision general matrix multiply (DGEMM) is often tuned by high performance computing (HPC) vendors to run as fast as possible, because it is the building block for so many other routines. It is also the most important routine in the LINPACK benchmark. For this reason, implementations of fast BLAS library may focus first on DGEMM performance.