Parallel processing is generally faster than scalar execution of one data point at a time. Single Instruction Multiple Data (SIMD) computers with multiple processing elements that perform the same operation on multiple data points achieve performance gains by taking advantage of parallelism and simultaneous use of multiple parallel execution cores.
SIMD processors can take advantage of parallelism in performing mathematical operations and in moving data. SIMD processors can load or store multiple data items simultaneously, resulting in a performance gain compared to the slower, scalar processors that load or store one datum at a time. When executing computer programs on a processor with parallel resources, utilizing SIMD instructions offers better performance than utilizing scalar instructions.
Programming using a SIMD Instruction Set Architecture (ISA), however, can be challenging. SIMD ISA's, for example, are generally processor-specific. Programs that use SIMD instructions may need to be rewritten and customized to suit a new processor generation. The work required to adapt scalar instructions to a new instruction set architecture, including rewriting the code, documenting the code; enabling compilers to emit the code, training users to use the code, and to debug and collect traces of code execution may need to be partly or wholly repeated for reuse with each new generation of instruction set architecture (e.g. MMX, SSE, SSE2, SSE3, SSE4, AVX, AVX2, AVX 3.1, and AVX 3.2). What is needed, therefore, is a way to allow programmers to take advantage of SIMD Instruction Set Architectures while avoiding the challenges inherent in conventional solutions.
Furthermore, conventional solutions are limited because they optimize the code statically, ahead of time, rather than dynamically, during execution. Compilers attempt to optimize execution of certain code sequences, but they operate in a static environment, without knowledge of the state of the machine or of the registers. Even SIMD code that was conventionally coded by hand is not able to optimize the code according to the run-time state of the machine and of the registers. What is needed, therefore, is a way to optimize instructions at run-time, with knowledge of the state of the registers and their contents.