The processing of two-dimensional sets of data is growing in importance as the use of computers continues to grow. Two-dimensional sets of data are frequently used to represent, e.g., images.
In the digital processing of two-dimensional signals, e.g., data sets, it is possible, for example when performing some two-dimensional filtering such as a low pass filtering operation or some two-dimensional transformation such as an inverse discrete cosine transform (IDCT) operation, to treat a two-dimensional operation as a series of two, one-dimensional operations. This is possible due to a mathematical property called separability. This separability property allows a complex two-dimensional process to be implemented as a series of two, one-dimensional processes.
Sequential one-dimensional processes tend to be far less complicated algorithms to implement, than a corresponding two-dimensional process. For this reason, the property of separability is frequently used to implement two-dimensional data processing operations. In implementing a two-dimensional operation as two, one-dimensional operations, the one-dimensional operations are applied sequentially in the horizontal and vertical directions of the data being processed. This is illustrated in FIG. 1 where the two-dimensional operation HV is implemented as two sequential processing operations H, V on the data set A 100 to produce the two-dimensional data set HV(A) 104. The intermediate data set H(A) 102 is produced as the result of the application of the horizontal function H to the data set A 100.
Suppose that data words, each represented by a separate box, are arranged in a memory in "raster-scan" order as illustrated in FIG. 2. In such an arrangement, data words beginning at the top left of a two-dimensional data array 200, following to the right and down to the bottom right data element are stored at sequential locations in memory as illustrated by the row of blocks 202 representing sequential memory locations. In processing the two-dimensional data in the horizontal direction the arrangement of the samples in the one-dimensional structure is convenient because each data sample follows the next. In order to process the data in the vertical direction it is clear from the first two shaded squares in FIG. 2 that access to the data is not as straightforward because there is a jump between the consecutive samples as represented by the arrow 203.
One known method of solving the problem of accessing the vertical rows of data for performing the vertical processing operation is to store the results from the horizontal processing operation in transposed order. This is shown in FIG. 3 wherein the shaded blocks representing a vertical column of data are now arranged horizontally.
As a result of the mathematical transpose accessing the vertical information is simple. At the end of the processing for the vertical direction, the transpose of the resulting data must normally be performed to restore the arrangement to the natural order for use in subsequent operations, e.g., the generation of video images for display.
Another method of accessing data to perform sequential horizontal and vertical data processing operations involves addressing the data that is stored in memory using a pointer that jumps to the next desired data sample. This method has the advantage, as compared to the transpose technique discussed above, that it does not require that the data undergo an additional transposition step in order to restore the natural data ordering for use in subsequent operations.
In high-performance implementations of digital signal processing algorithms, which may include various real time image processing applications, it is good practice to keep data that is being processed in hardware registers close to the main computational unit in order to minimize processing delays due to data transfer operations. The computational unit may be, e.g., a programmable signal processing core or some fixed function hardware. As a result of the "closeness" of the data registers to the computational unit, the computational unit can operate directly on the registers.
In cases where the data is not located in registers coupled closely to the computational unit, the data has to be fetched from cache or other memory and this results in reduced system performance. By keeping data which is frequently used in data registers which are directly accessible to a computational unit, a high level of computational speed can be maintained throughout the lifetime of a computation without having the computational unit stall due to data being in lower speed storage such as a cache or main memory.
Single-Instruction Multiple Data (SIMD) architecture systems allow multiple data elements to be processed simultaneously in response to a single instruction. The multiple data units may be stored in a single register. Well designed SIMD architectures can provide considerable performance advantages over more traditional Single-Instruction Single Data (SISD) architecture systems because of the simultaneous processing of multiple pieces of data made possible by the SIMD architecture. MMX technology from Intel Corporation currently in use in computer CPUs is one example of a SIMD architecture.
Unfortunately the above described techniques of performing sequential horizontal and vertical processing operations are not straightforward when the data is stored in registers in a format that is used by SIMD architectures. In such a situation, the manipulations that are required to obtain the desired data arrangement are relatively difficult to implement.
Consider for example, a SIMD architecture that operates on two data samples at the same time. In such a SIMD architecture the data samples have to be presented to the processing unit in the arrangement shown in the diagram of FIG. 4A. Here, one word 400 that is n-bits in length, contains two sub-words 402, 404, each n/2-bits in length. Even though one n-bit word 400 is presented to the processor, there are actually two pieces of data, sub-words b, a, 402, 404 that are embedded in that word 400. When presented to the SIMD processing unit, each of these halves is handled separately. This is one of the primary features of the SIMD processing.
As an example of a SIMD processing operation, suppose that it is desired to add two sets of numbers, {a, b} and {c, d} to produce {a+c} and {b+d}. In the SIMD architecture, it is possible to set up two data elements 406, 408 similar to the one shown in FIG. 4A. One of these 406 would contain the set {a, b} and the other 408 would contain the set {c, d}. They may be presented to the SIMD processing unit for the desired addition. The processing unit treats the two halves of the input data words as independent quantities during the computation. An important consequence of this is that if the addition for the lower half overflows, the overflow will not affect the upper half. It can be seen from this example that the SIMD architecture is extremely beneficial for processing multiple pieces of data in parallel.
The inventors of the present application have discovered that various problems are encountered when one attempts to implement two-dimensional signal processing algorithms on SIMD architecture using local registers to provide high-performance signal processing implementations. For example, when processing two-dimensional signals, the SIMD architecture poses the following problem when data is to be transposed. Suppose that it is desired to obtain the transpose of the matrix: ##EQU1##
where the data is arranged in registers 0 and 1 as shown in FIG. 5. Note that the little-endian data scheme is used for the examples in this application, however this is simply for purposes of explanation of the invention and in no way limits the present invention to use only with little-endian data schemes. The transposed matrix will have the arrangement shown in FIG. 6.
Unfortunately, when two items of data, e.g., sub-words a and b, are packed into a conventional long register, the individual elements can not be accessed efficiently. That is, directdata access is limited to the full word (ba) and not one of the sub-words (b) or (a). This register access limitation which exists in conventional registers makes it relatively difficult to transform the data arrangement of FIG. 5 into the transposed arrangement of FIG. 6. This is because it is not possible to access directly the individual data sub-words of a conventional register.
Various known approaches to transposing data stored in registers include the use of software or the use of special transposition hardware. Software has the advantage of being flexible in that, minor modifications to the software of a program can allow the program to transpose arrays of different shapes and sizes. Unfortunately, software approaches have the major disadvantage of being relatively slow and time consuming because of the relatively large number of clock cycles required and the need to transfer and store the contents of the registers in, e.g., memory, while the register contents are being processed according to the software instructions.
Known special transposition hardware also suffers several disadvantages. These include the need to use sequential logic, e.g., logic which includes buffers or delay elements, or logic which is limited in terms of the size and/or shape of an array which can be transposed. The use of sequential logic introduces undesirable time delays while constraints on the size and shape of arrays which can be transposed limit the utility of special transposition hardware to specific applications.
In view of the above discussion, it becomes apparent that there is a need for new and improved methods and apparatus for accessing and transposing two-dimensional sets of data stored in hardware registers. It is desirable that such improved methods and apparatus be compatible with SIMD architectures and the data access requirements of such architectures. In particular, it is desirable that any new methods or apparatus allow the contents of a register to be accessed as a single unit or as a plurality of sub-units.
From a performance perspective, it is also desirable that any new methods and apparatus be capable of being implemented without the need for buffering or other temporary storage of register contents which can cause performance delays.
In addition to supporting transpose operations it is desirable that new and improved methods and apparatus for manipulating the contents of registers be capable of supporting data processing operations, other than transpose operations, which may require the manipulation of data in data units which are smaller than the full size of a utilized data register.
New SIMD instructions capable of taking advantage of the processing capabilities of any new methods and apparatus are also desirable.