This invention relates to processor architecture and image processing applications, and more particularly to the register file(s) and instructions of a mediaprocessor.
Different computing applications place different demands on a processor. As a result, various microprocessor architectures have evolved to best handle specific type of applications. As it is not economical to have too many differing architectures, it also is desirable that a processor be able to serve many types of processing needs.
Accordingly, there are at least two common types of architectures for handling complex data applications such as image processing applications. There are single instruction multiple data (SIMD) processor architectures and very long instruction word (VLIW) processor architectures. Mediaprocessors with instructions adapted for complex image processing tasks can utilize both the SIMD architecture and VLIW architecture.
When executing image processing tasks it is common to perform two-dimensional convolutions and two-dimensional transforms, such as wavelet transforms, discrete cosine transforms and fast fourier transforms. One strategy for reducing the computation requirements of such algorithms is to implement the algorithm in two separable passes of one dimensional processing (e.g., row-wise processing followed by column-wise processing). For example, the number of multiplications of a direct Nxc3x97N two dimensional discrete cosine transform is N4, while it is 2N3 when separable one dimensional discrete cosine transforms are executed.
Many image and video processing algorithms handle data elements that are small in size. Mediaprocessors take advantage of this property by employing partitioned instructions that simultaneously process multiple data elements packed into one register.
Referring to FIG. 1, an exemplary SIMD partitioned-add instruction is implemented as eight 8-bit additions 11-18 executed in parallel using a 64-bit data path. In a discrete cosine transform algorithm partitioned instructions are used for an array of image data as a first pass process in one dimension. The process then is to be repeated in a second, orthogonal dimension. Referring to FIG. 2, a first data set 20 with rows xe2x80x98jxe2x80x99 and columns xe2x80x98ixe2x80x99 is processed in row-wise fashion to achieve a resulting data set 22. To process the data set 22 along the orthogonal dimension the array is transposed using a series of processing instructions to achieve data set 22xe2x80x2. The end result of the instructions is to reload the data so that the data as stored in the processor registers correspond to columns of the data set 22, rather than to rows. The data is processed for the second dimension to achieve data set 24. The result then is transposed again to achieve data set 24xe2x80x2. The initial data set 20 now has been processed in row-wise and column-wise fashion to perform the two-dimensional processing. For a typical RISC architecture it takes approximately 112 instructions (56 loads and 56 stores) to transpose an 8xc3x978 eight-bit data array. The MAP1000 mediaprocessor includes instructions allowing an 8xc3x978 eight-bit block to be transposed in 40 instructions using one cluster. The transposition of an array occurs often and consumes many clock cycles. Accordingly, an efficient manner of transposing an array of data is needed.
According to the invention, a processor includes a register file which allows normal row-wise access to data stored in registers of the register file and also allows a transposed column-wise access to data stored in a column among registers of the register file.
According to one aspect of the invention, a row of data stored in a register of the register file forms n partitions of m-bits of data per partition, (i.e.,a given register stores n * m bits). In transposed access mode, an n * m bit data operand is accessed in a given partition of each of n registers. Thus, one register stores a first partition of the data operand. An adjacent register stores the second partition of the data operand, and so forth for each of the n partitions of the operand.
According to another aspect of the invention, a queue-based transposable register file is implemented. The transposable register file include a head pointer and a tail pointer and has a virtual register. Data written into the virtual register is written into one of the registers of the transposable register file as selected by the head pointer. Data read from the virtual register is read from one of the registers of the transposable register file as selected by the tail pointer.