1. Field of the Invention
The present invention relates to permuting subwords packed in registers in which the subwords can be re-arranged within a register and between registers for achieving parallelism in subsequent processing, such as two-dimensional multimedia processing.
2. Description of the Related Art
Efficient processing of multimedia information like images, video and graphics breaks both the sequential processing paradigm and the linear data processing paradigm inherent in the design of computers. Computers have been conventionally designed primarily to process linear sequences of data: memory is addressed as a linear sequence of bytes or words, and data is fetched into the programmable processor and processed sequentially. Efficient processing of pixel-oriented visual material is inherently parallel rather than sequential, and two-dimensional rather than linear (1-D).
Multimedia extensions have been added to general-purpose processors to accelerate the processing of different media types, see Ruby Lee, “Subword Parallelism with MAX-2”, IEEE Micro, Vol. 16 No. 4, August 1996, pp. 51–59; IA-64 Application Developer's Architecture Guide, Intel Corporation, Order Number: 245188-001, May 1999. http://developer.intel.com/design/ia64; and AltiVec Extension to PowerPC Instruction Set Architecture Specification. Motorola, Inc., May 1998. http://www.motorola.com/AltiVec. Subword parallelism has been deployed by multimedia instructions in microprocessor architectures and in media processors to accelerate the processing of lower-precision data, like 16-bit audio samples or 8-bit pixel components. SIMD (Single Instruction Multiple Data) parallel processor techniques within a single processor have been referred to as microSIMD architecture, see Ruby Lee, “Efficiency of microSIMD Architectures and Index-Mapped Data for Media Processing”, Proceedings of Media Processors 1999, IS&T/SPIE Symposium on Electric Imaging: Science and Technology, January 1999, pp. 34–46. A subword-parallel (or microSIMD) instruction performs the same operation in parallel on multiple pairs of subwords packed into two registers, which are conventionally 32 to 128 bits wide in microprocessors and mediaprocessors. For example, a 64-bit word-oriented datapath can be partitioned into eight 8-bit subwords, or four 16-bit subwords, or two 32-bit subwords.
Conventional shift and rotate instructions have been used to move all the bits in a register by the same amount. Extract and deposit instructions, found in instruction-set architectures like PA-RISC move one field using one or two instructions, as described in Ruby Lee, “Precision Architecture”, IEEE Computer, Vol. 22, No. 1, January 1989, pp. 78–91. Early subword permutation instructions like mix and permute in the PA-RISC MAX-2 multimedia instructions are a first attempt to find efficient and general-purpose subword permutation primitives, as described in Ruby Lee, “Subword Parallelism with MAX-2”, IEEE Micro, Vol. 16 No. 4, August 1996, pp. 51–59. The subwords in the source register are numbered and a permute instruction specifies the new ordering desired in terms of this numbering. The mux instruction in IA-64 described in IA-64 Application Developer's Architecture Guide, Intel Corporation, Order Number: 245188-001, May 1999. http://developer.intel.com/design/ia64 and the vperm instruction described in Altivec, AltiVec Extension to PowerPC Instruction Set Architecture Specification. Motorola, Inc., May 1998. http://www.motorola.com/AltiVec are similar. There is a limit to the efficiency of the permute instruction for many subwords, since the control bits quickly exceed the number of bits permuted. Permuting four subwords requires only 8 control bits, which can be encoded in the permute instruction itself. Beyond four elements and up to sixteen elements, any arbitrary permutation can still be performed with one instruction, by providing the control bits for the permutation in a second source register, rather than in the 32-bit instruction. Permuting 32 elements requires 160 bits, and permuting 64 elements requires 384 bits (n*log n bits). Hence, permuting more than 16 elements cannot be achieved by a single instruction with two source registers, using this method of specifying permutations. The problem is further complicated by the fact that image, video or graphics processing requires mapping of two-dimensional objects onto subwords in multiple registers and then permuting these subwords between registers.
U.S. Pat. No. 5,673,321 describes a computer instruction (MIXxx) which selects subword items from two source registers in pre-defined ways, for example: MIXWL (Mix Word Left) concatenates the left half (32 bits) of register R1 with the left half of register R2. MIXWR (Mix Word Right) concatenates the right half of R1 with the right half of R2. MIXHL (Mix Half-word Left) concatenates in turn, the first half-words of R1 and R2, followed by the third half-words of R1 and R2. MIXHR (mix Half-word Right) concatenates in turn, the second half-words of R1 and R2, followed by the fourth half-words of R1 and R2, and the like. The instruction also may contain other fields. For example, the MIXxx instructions described above may be used to transpose a 4×4 matrix of half-words contained in four registers R1, R2, R3, R4, each with 4 half-words. MIXBx selects alternate bytes from two source registers, R1 and R2, in two pre-defined ways: MIXBL alternates the 4 odd bytes of R1 with the 4 odd bytes of R2; MIXBR alternates the 4 even bytes of r1 with the 4 even bytes of r2. The MIXBL instruction may be used, for example, to unpack and pack bytes into and out of the more significant half of corresponding half-words. This instruction may be used to “unpack” a register with 8 bytes into 2 registers of 4 half-words each, with each byte being the more significant byte of each half-word. The MIXBL instruction may also be used to unpack and pack bytes into and out of the less significant half of corresponding half-words.
It is desirable to provide efficient subword permutation instructions that can be used for parallel execution for example in 2-D multimedia processing.