The instructions in the instructions sets used with single instruction multiple data (SIMD) architectures operate on a plurality of operands with the same operation. For example, first and second floating point registers are used to store source operands A0 to An, and B0 to Bn, respectively. For a particular function op, each source operand A_s (where s ranges from 0 to n) in the first register and an identically positioned source operand B_s in the second register may be operated on by an execution unit of a microprocessor to produce a result R_s. The result R_s is stored in a corresponding location in a result register.
SIMD instructions have the potential to deliver significant performance improvements in a wide variety of important applications. However, the pair-wise operation (e.g., A0+B0∥A1+B1) of these SIMD instructions can make them difficult to use effectively if the data incorrectly organized or misaligned. This tends to be more of a problem when retrofitting SIMD processing to existing applications, where the data organization may have been undertaken without taking its suitability for SIMD in to consideration. However, even in new applications, the requirement to organize and align the data to suit the requirements of the SIMD instructions can be a significant burden for the programmer (and/or compiler), particularly if operations like convoluted cross-product operations are required. Further, autovectorization, the process where the compiler will automatically use SIMD instructions, can be often frustrated by data organization or alignment problems, therefore significantly curtailing the benefits of the SIMD support.
To combat these problems, two approaches have typically been employed. Firstly, an ever more complex set of instructions have been introduced in an effort to allow programmers to more cost effectively reorganize the data before processing. Secondly, new SIMD instructions have been introduced that perform operations in a different order to the standard pair-wise ordering, in an effort to support other commonly occurring data organizations (e.g., an array of structures versus a structure of arrays).
The requirement to use data reorganization (swizzle) instructions will always introduce a performance overhead. While the sophistication of these swizzle instructions has improved over time, they can still cut performance by 50% in many situations. Further, this situation tends to be exacerbated on chip multithreading (CMT) processors, where there tends to be: i) slightly fewer execution resources; and ii) many hardware strands sharing these resources. In this situation, it is often not feasible to “hide” the impact of the swizzle instructions—even if the latency of the operations themselves can be hidden, the requirement to issue these additional instructions will often prevent other, more useful, processing from being undertaken. With respect to adding new forms of SIMD instructions in an attempt to handle different data organizations; it is limiting since only a few additional organizations can be realistically supported, and, in addition, it is very wasteful of opcode resources—an increasingly valuable commodity on RISC processors with 32-bit opcodes. Further, in some situations the formatting or alignment cannot be easily determined statically.