An instruction set, or instruction set architecture (ISA), is the part of the computer architecture related to programming, and may include the native data types, instructions, register architecture, addressing modes, memory architecture, interrupt and exception handling, and external input and output (I/O). The term instruction generally refers herein to macroinstructions—that is instructions—that are provided to the processor (or instruction converter that translates (e.g., using static binary translation, dynamic binary translation including dynamic compilation), morphs, emulates, or otherwise converts an instruction to one or more other instructions to be processed by the processor) for execution—as opposed to micro-instructions or micro-operations (micro-ops)—that is the result of a processor's decoder decoding macroinstructions.
The ISA is distinguished from the micro-architecture, which is the internal design of the processor implementing the instruction set. Processors with different micro-architectures can share a common instruction set. For example, Intel® Core™ processors and processors from Advanced Micro Devices, Inc. of Sunnyvale Calif. implement nearly identical versions of the x86 instruction set (with some extensions that have been added with newer versions), but have different internal designs. For example, the same register architecture of the ISA may be implemented in different ways in different micro-architectures using well-known techniques, including dedicated physical registers, one or more dynamically allocated physical registers using a register renaming mechanism, etc.
Many modern ISAs support Single Instruction, Multiple Data (SIMD) operations. Instead of a scalar instruction operating on only one data element or pair of data elements, a vector instruction (also referred to as packed data instruction or SIMD instruction) may operate on multiple data elements or multiple pairs of data elements simultaneously or in parallel. The processor may have parallel execution hardware responsive to the vector instruction to perform the multiple operations simultaneously or in parallel.
A SIMD operation operates on multiple data elements packed within one register or memory location in one operation. These data elements are referred to as packed data or vector data. Each of the vector elements may represent a separate individual piece of data (e.g., a color of a pixel, etc.) that may be operated upon separately or independently of the others.
In some scenarios, a piece of source code may specify a particular order for carrying out a reduction operation on an array of data elements. An example of a reduction operation is addition, which adds all of the data elements in the array to produce a single sum, such as the operation specified in the following serial source code:
 float *a;float sum = 0.0;for (int i = 0; i < 100x1024; ++i){ sum += a[i];}
The above source code performs a reduction operation on an array by summing array elements in an increasing order. For floating-point data elements, a change to the order in which the data elements are added can change the final sum—although the change can be slight. In scientific computation that requires high-precision arithmetic, even a slight change may be unacceptable. Therefore, there is a need to maintain the order in which the data elements are operated to preserve the precise rounding behavior specified by the source code. However, serial computation such as the above is time consuming. If the floating point computations could be reordered, the summation could be accomplished by accumulating four partial sums, which would then be added together outside of the loop. In this case, the loop body loads four single-precision values at a time and would contain:
movups (%[a], %[i], 4), % xmm0//load 16B
addps % xmm0, %[sum]
The above assembly code uses packed data addition (also referred to as vector addition) ‘addps’ which accumulates the content of a vector register (xmm0) into a sum. The assembly code is more efficient than the serial source code for its use of vector operation; however, the assembly code does not preserve the order of the reduction operation as in the serial source code and may generate a different result from that of the serial source code.