This invention relates to microprocessor instruction sets, and more particularly to a multimedia instruction set for handing multiple data operands on a mediaprocessor.
A microprocessor executes programmed instructions to perform desired functions. Typically for a given microprocessor model there is a given instruction set. An instruction set is a set of machine instructions that a microprocessor recognizes and can execute. Each instruction in the set is identified by a digital instruction code, operands for specifying an address, special bits used for indexing or another purpose, and occasionally a data item itself.
To meet the demanding computing needs of digital video processing and other multimedia applications, various levels of parallelism in microprocessors have developed. Because imaging algorithms are easily adapted to parallel processing structures, parallel architectural features are becoming available at a reduced cost. Previously, applications requiring high computational performance have been implemented on multiprocessor systems. In such systems a task is broken up into pieces, and the multiple pieces are executed in parallel by multiple processors. Such multiprocessor systems have not gained widespread commercial acceptance because of their high costs. Also, the improved performance of inexpensive general purpose microprocessors and the more recent digital signal processors have provided a less expensive platform for more complex processing tasks. Many new microprocessors and digital signal processors are employing on-chip parallelism, mainly by a technique referred to as instruction level parallelism. Such processors as adapted for multimedia processing (e.g., video processing) are referred to herein as mediaprocessors.
Instruction-level parallelism is where multiple operations are initiated in a single clock cycle. Two approaches to instruction-level parallelism are: the very long instruction word (VLIW) architecture and the superscalar architecture. In a VLIW architecture processor there are many independent functional units. Each long instruction contains an operation code for each functional unit. All functional units receive their operation code at substantially the same time. The functional units execute their assigned tasks concurrently. Superscalar architectures use special on-chip hardware to look through the instruction stream and find independent operations that can be executed at the same time to maximize parallelism.
Instruction-level parallelism is further extended in some systems using subword parallelism, in which an execution unit is partitioned into multiple smaller units. For example, processes implemented by a 64-bit arithmetic logic unit (ALU) in essence split the ALU logically into four smaller 16-bit ALU""s. Specifically the data input to the ALU is a concatenation of four smaller subwords. The ALU output is a concatenation of the results on the four subwords. Such subword parallelism is incorporated into an architecture by providing what are referred to as xe2x80x9csingle instruction multiple dataxe2x80x9d (SIMD) instructions. Examples of such an implementation are: Sun Microsystem""s visual instruction set, Intel""s multimedia extension, Hewlett-Packard""s multimedia acceleration extensions, Digital Equipment Corporation""s multimedia extensions, and Silicon Graphics, Inc.""s MIPS digital media extension. Instructions among these extensions treat a data word (e.g., 32 bit or 64 bit) as a set of multiple subwords (8, 16 or 32). Partitioned operations may be executed on each subword, obtaining 2-times, 4-times or 8-times performance improvement with minimal added hardware. Even with these new architectures, however, carefully developed algorithms are needed to achieve a significant performance improvement.
Current SIMD instructions are categorized into the following groups: (i) partitioned arithmeticaogic instructions; (ii) sigma (xcexa3) instructions; (iii) partitioned select instructions; and (iv) formatting instructions. Partitioned arithmetic/logic instructions include partitioned add, partitioned subtract, partitioned multiply, partitioned compare, partitioned shift, and similar type instructions. For example, in a partitioned addition instruction a data word is partitioned into subwords and each subword is used for respective addition operations. Sigma instructions include inner product, sum of absolute difference, sum of absolute value and similar instructions. These instructions are characterized by the xe2x80x9csum ofxe2x80x9d a set of operations. The sum of operation is referred to in mathematics with the greek symbol sigma, (xcexa3). Partitioned select instructions include partitioned min/max, partitioned conditional selection, and similar instructions. Formatting instructions include map, interleave, compress, expand, and similar instructions.
Partitioned arithmetic/logic instructions, sigma instructions, and partitioned select instructions speed up processing by performing multiple operations concurrently in one direction. Formatting instructions are used mainly for rearranging data to allow parallel-type processing of the data (e.g., in a pipeline). Most SIMD instructions have been developed on 32-bit or 64-bit architectures. Such bit size limits the maximum number of concurrent operations that can be performed. Widening the data path would seem to be one way of increasing the data-level concurrence. However, while the partitioned arithmetic/logic instructions and the partitioned select instructions can be readily extended to a wider architecture, the sigma instructions and some formatting instructions would require more complicated hardware and result in additional pipeline stages when extended for a machine with a wider architecture (than the conventional 32-bit and 64-bit architectures). This is because the sigma and formatting instructions involve operations across multiple data partitions within a word. The hardware complexity for such operations would increase more than linearly as the data path width increases. Accordingly, there is a need for more flexible SIMD instructions which are effective at handling multimedia data for processors having wider architectures.
Another shortcoming of current multimedia instruction offerings is that, typically, the arithmetic precision is not well handled. Specifically, in partitioned add/subtract/multiply instructions, the destination operand word is given the same number of bits as the source operand words. Consequently, the overflow must be handled by scaling down the results, which inevitably introduces some truncation error. This is particularly undesirable, because when these partitioned arithmetic instructions are cascaded, the truncation error accumulates, sometimes leading to an unacceptably large error. Accordingly, there is a need for more effective partitioned arithmetic multimedia instructions.
According to the invention, a set of multimedia instructions is implemented which overcomes the shortcomings of the prior conventional SIMD instruction sets.
According to one aspect of the invention, conventional sigma instructions are supplemented with partitioned sigma instructions. Conventional sigma instructions include: inner product; sum of absolute differences, sum of absolute values, and sum of subwords. Additional instructions are provided, including a partitioned inner product instruction, a partitioned sum of absolute differences instruction, a partitioned sum of absolute values instruction. Similar partitioned sigma instructions may be provided corresponding to other sigma instructions. One advantage of partitioning the sigma instructions is that multiple sigma instructions are executed concurrently to effectively use the capacity of the mediaprocessor.
According to another aspect of the invention, special registers are included for aligning data on memory word boundaries to reduce packing overhead in providing long data words for instructions which implement data sequences which shift during subsequent iterations.
According to another aspect of the invention, precision is improved for partitioned arithmetic instructions. Specifically, xe2x80x98extendedxe2x80x99 partitioned arithmetic instructions are provided. An advantage of these instructions is that accumulation of precision errors are avoided. In particular accumulated precision errors are truncated.
According to another aspect of the invention, additional formatting instructions are provided. Such additional instructions are partitioned formatting instructions and include partitioned interleave, partitioned compress, and partitioned interleave and compress. An advantage of these instructions is that subwords are packed in an effective order for performing other partitioned operations.
According to another aspect of the invention, mixed precision source operands are supported for the partitioned sigma instructions, extended partitioned arithmetic instructions and partitioned formatting instructions.