An instruction set, or instruction set architecture (ISA), is the part of the computer architecture related to programming, and may include the native data types, instructions, register architecture, addressing modes, memory architecture, interrupt and exception handling, and external input and output (I/O). The term instruction generally refers herein to macro-instructions—that is instructions that are provided to the processor (or instruction converter that translates (e.g., using static binary translation, dynamic binary translation including dynamic compilation), morphs, emulates, or otherwise converts an instruction to one or more other instructions to be processed by the processor) for execution—as opposed to micro-instructions or micro-operations (micro-ops)—that is the result of a processor's decoder decoding macro-instructions.
The ISA is distinguished from the micro-architecture, which is the internal design of the processor implementing the instruction set. Processors with different micro-architectures can share a common instruction set. For example, Intel® Core™ processors and processors from Advanced Micro Devices, Inc. of Sunnyvale Calif. implement nearly identical versions of the x86 instruction set (with some extensions that have been added with newer versions), but have different internal designs. For example, the same register architecture of the ISA may be implemented in different ways in different micro-architectures using well-known techniques, including dedicated physical registers, one or more dynamically allocated physical registers using a register renaming mechanism, etc.
Many modern ISAs support Single Instruction, Multiple Data (SIMD) operations. Instead of a scalar instruction operating on only one or two data elements, a vector instruction (also referred to as packed data instruction or SIMD instruction) may operate on multiple data elements or multiple pairs of data elements simultaneously or in parallel. The processor may have parallel execution hardware responsive to the vector instruction to perform the multiple operations simultaneously or in parallel. A SIMD operation operates on multiple data elements packed within one vector register or memory location in one operation. These data elements are referred to as packed data or vector data. Each of the vector elements may represent a separate individual piece of data (e.g., a color of a pixel, etc.) that may be operated upon separately or independently of the others.
In some scenarios, a SIMD operation may operate on independent vector data elements in a recursive manner, where the number of iterations is different for different data elements. Thus, computation for some data elements may be finished while some other data elements still need more iterations. One example of the recursive computation is a WHILE loop operation. In this example, a data array X[i] (i=0, . . . , N-1) of N elements is subject to a recursive computation while the condition(X[i]) is true (satisfied). The computation for X[i] terminates when condition (X[i]) becomes false. An example of the condition may be X[i]>0.
for (i=0; i<N; i++){                while (condition(X[i])){                    X[i]=computation(X[i]);}}                        
The above computation cannot be easily vectorized if the number of the WHILE loop iterations is different for different data elements of X[i]. One possible approach is for a processor to perform computation over those elements that do not satisfy the condition, and then throw away the results derived from those elements. However, this approach has low efficiency because the processor not only performs unnecessary computation over those elements, but also is unable to utilize the vector register slots occupied by those elements.