1. Field of the Invention
The present invention generally relates to improving efficiency of in-place data transformations such as a matrix transposition. More specifically, part of the data to be transformed is pre-arranged, if necessary, to first be contiguously arranged in memory as contiguous blocks of contiguous data, which data is then available to be retrieved from memory into cache in units of the blocks of contiguous data, for application of a transformation on the data such as a matrix transposition, and then replaced in memory. The part of the data not transformed is saved in a buffer and later placed out-of-place back into holes of the transformed data.
2. Description of the Related Art
As an example of the type of data transformations that the present invention can make more efficient, there are in-place algorithms for matrix transposition that works on individual matrix elements. Because the individual matrix elements must be referenced in an essentially random order for large M and N these codes run very slowly. U.S. Pat. No. 7,031,994 to Lao, et. al., partially addresses this problem.
However, as explained in more detail below, the results of Lao have quite limited scope. In many instances where the technique works, a fair amount of extra storage is used. They assume the underlying permutation is known but give no indication on how they find this structure or the amount of extra storage required.
Thus, as one possible application demonstrating the method of the present invention, there exists a need for a more efficient method of in-place matrix transposition, one that has generic capability regardless of its size and shape of the given input matrix.
However, transposition of matrix data is only one possible application of the method of the present invention since it is directed to the more generic problem of performing an in-place transformation of a matrix represented in one of the two standard formats of matrices to the same matrix data in another matrix format such as new data structures (NDS) where it will be able to apply the transformation in a fast manner.
An example is the matrix transposition discussed in Lao, et al. Several other examples will now be given. NDS for the two standard formats of matrices, full and packed matrices are given in “High-performance linear algebra algorithms using new generalized data structures” by Gustavson. For full format, the square block (SB) format of order NB is defined as an example of an NDS. In the present invention, we generalize NDS include rectangular block (RB) format of size MB by NB.
Thus, returning to examples that could utilize the method of the present invention, another example is to transpose a RB matrix of size M by N in-place. Another example is to transform a lower SBPF matrix to an upper SBPF matrix. A third example is to transform a packed matrix to a RFP matrix, as described in the second above-identified co-pending application. All three examples here admit inverse transformations so we really have six examples.
In the context of the present invention, the term “transformation” means that N data points map into each other as a permutation of the originally-stored N data points. The term “in-place” means that the permutation of N data points is returned to the same memory location as that used for the original data. A typical transformation will be accomplished by using a CPU of the computer to perform the transformation mapping algorithm.
The term NDS, standing for new data structures, is a term describing novel ways of storing the elements of full and packed matrices in a computer. The idea behind NDS is to represent the storage layout of a matrix as a collection of contiguous sub-matrices each of size MB by NB. MB and NB are chosen so that MB*NB is about the size of a cache.
The present invention addresses the problem noted by the inventors that a data transformation, including in-place data transformations, involve a mapping between the original data stored in memory and the relocated data as desired by the transformation. The problem being more specifically addressed involves the inefficiency resultant from conventional transformation methods applied to matrices in standard formats wherein data to be moved must be moved as single words randomly to other single words in the memory space. This happens almost all the time when the data in matrices is represented in standard matrix formats. When data is retrieved from memory and moved to a cache for purpose of the transformation processing, the desired data word is located in memory and retrieved as one data word of a chunk of memory words called a memory line of size LS that will, in general, contain LS−1 other words that are not be applicable for the processing or at least not immediately needed for the processing. A typical value of LS is 128 bytes or 16 double words.
Therefore, in recognizing the random nature of data movement governed by the matrix transformations for matrices stored in standard formats, the present inventors recognized that data transformations of single words of data can be quite inefficient, because the data retrieved for executing the transformation algorithm by the CPU will, in general, only contain only one of LS words that will be consumed by the transformation algorithm.
Thus, a need exists to improving efficiency of data transformations by ensuring that all data currently needed for the execution of the transformation algorithm are being retrieved from memory contiguous lines of size LS, thereby precluding the needless handling of data not currently needed nor wanted.