An instruction set, or instruction set architecture (ISA), is the part of the computer architecture related to programming, and may include the native data types, instructions, register architecture, addressing modes, memory architecture, interrupt and exception handling, and external input and output (I/O). The term instruction generally refers herein to macro-instructions—that is instructions that are provided to the processor (or instruction converter that translates (e.g., using static binary translation, dynamic binary translation including dynamic compilation), morphs, emulates, or otherwise converts an instruction to one or more other instructions to be processed by the processor) for execution—as opposed to micro-instructions or micro-operations (micro-ops)—that is the result of a processor's decoder decoding macro-instructions.
The ISA is distinguished from the micro-architecture, which is the internal design of the processor implementing the instruction set. Processors with different micro-architectures can share a common instruction set. For example, Intel® Core™ processors and processors from Advanced Micro Devices, Inc. of Sunnyvale Calif. implement nearly identical versions of the x86 instruction set (with some extensions that have been added with newer versions), but have different internal designs. For example, the same register architecture of the ISA may be implemented in different ways in different micro-architectures using well-known techniques, including dedicated physical registers, one or more dynamically allocated physical registers using a register renaming mechanism, etc.
Many modern ISAs support Single Instruction, Multiple Data (SIMD) operations. Instead of a scalar instruction operating on only one data element or pair of data elements, a vector instruction (also referred to as a packed data instruction or SIMD instruction) may operate on multiple data elements or multiple pairs of data elements simultaneously or in parallel. The processor may have parallel execution hardware responsive to the vector instruction to perform the multiple operations simultaneously or in parallel.
A SIMD operation operates on multiple data elements packed within one register or memory location in one operation. These data elements are referred to as packed data or vector data. Each of the vector data elements may represent a separate individual piece of data (e.g., a color of a pixel, etc.) that may be operated upon separately or independently of the others. SIMD architectures rely on the compiler to vectorize loops for performance. Loops that perform various forms of associative reduction operations (e.g., additions, multiplications, logical operations, etc.) are commonly found in general-purpose applications, system software as well as in floating point intensive and multimedia applications. The reduction operations may be executed conditionally or unconditionally, over a scalar or an array with a unit strided or a non-unit strided access pattern. Array reduction loops with an access stride distance that is less than the vector length cannot be vectorized by current compilers due to the presence of lexically-backward loop-carried flow dependency.
Existing instructions do not encapsulate associative array reduction operations with a non-unit stride, and do not encapsulate associative array reduction operations with unit stride that is executed conditionally. The limitations of the existing instructions prevent vectorization of certain types of reduction loops and, consequently, can result in loss of performance.