Microprocessors may include various execution units to perform operations on data. Such execution units may include arithmetic logic units (ALU's), floating point, integer, and other specialized execution units. To improve the efficiency of multimedia applications among other applications, a single instruction multiple data (SIMD) architecture may enable one instruction to operate on several data simultaneously, rather than on a single data. With parallel hardware execution, multiple operations can be performed with a single instruction, improving performance.
To enable various operations to take advantage of such architectures, so-called shuffle operations may be performed on packed data residing in a register or other location to rearrange the data elements prior to other operations such as SIMD operations. Still other instructions cause data in one or more locations to be shifted by a given amount to provide a desired result. Some processors include multiple units to perform shuffle operations on larger data operands, e.g., 128-bit operands. By requiring the use of multiple units, increased real estate in terms of a chip's area as well as increased power consumption during operation occurs. Furthermore, other operations such as shift operations are performed in different execution units, requiring additional expenses in terms of area and power consumption.
Shuffle-based instructions (among other instructions) may be performed using sub-instruction operations, such as micro operations (μops) in some instruction set architectures, to enable operations that are needed to obtain a desired result. Such μops may include shuffling, insertions, shifting, concatenating, packing, unpacking and the like. Furthermore, different flavors of such instructions may be used to support different data granularities of data. Given all of this, a variation in control and data path requirements may exist. Accordingly, multiple execution units may be needed to perform these operations and it may take multiple μops and machine cycles to perform these operations. Thus power requirements are raised and undesirable latencies occur.