A vector instruction is an instruction that executes on a group of values using one instruction. For example, in the x86 architecture, the streaming SIMD extension (SSE) instruction ADDPS $xmm0, $xmm1 (add packed single-precision floating-point values), the two xmm registers each holds 4 single precision floating point values that are added together and stored in the first register. This behavior is equivalent to the pseudo-code sequence:
for(i = 0; i< 4; i++ )   $xmm0[i] = $xmm0[i] + $xmm1[i]
The group of values can come from registers, memory, or a combination of both. Registers that hold groups of values, generally intended for use by vector instructions, are referred to as vector registers. The number of values in a group is called the vector length. In some examples, the vector length is also used to describe the number of operations performed by the vector instruction. Generally, the number of values in a vector register and the number of operations in a corresponding vector instruction calling for the vector register are the same, but they can be different in certain situations.
An instruction set architecture (ISA) including vector instructions is known as a vector ISA or vector architecture. A processor that implements a vector ISA is known as a vector processor.
A vector ISA where all vector instructions read their vector inputs from memory and write to memory without using any vector registers is known as a memory-to-memory vector or memory-vector architecture.
A vector ISA where all vector instructions, other than loads or stores, use only vector registers without accessing memory, is known as a register-vector architecture.
Vector instructions (such as the ADDPS above) can implicitly specify a fixed number of operations (four in the case of the ADDPS instructions). These are called fixed-length vector instructions. Another term for fixed-length register-vector instructions is SIMD (Single Instruction Multiple Data) instructions.
Previous generation vector processors were implemented on multiple boards using customized techniques to improve performance. The majority of them were targeted at high-performance computer applications, such as weather prediction, which often require supercomputers. However, technology development enabled single-chip microprocessors to out-perform these multi-board implementations, resulting in these vector processors being phasing out. Instead, supercomputers became multi-processors that combined multiple of these high-performance microprocessors together.
A common characteristic of these processors is that they were not generally compatible with earlier models from the same company because the instruction set varies from model to model. This practice was motivated by the fact that they were targeted at problem domains where it was critical to extract as much performance as possible, and people were willing to rewrite the application to do so. But, this practice may result in implementation details of the machine being exposed in the instruction set, and instruction sets may change as the machine implementation details change from model to model. For example, the maximum vector length that could be specified was determined by the maximum number of elements that the vector register could hold in each implementation.
The second stream of vector processors emerged as the density of transistors kept going up. By the late 1990s, general-purpose microprocessors had reached a point of diminishing returns by increasing the number of scalar functional units they could support even though there was still chip area that could be used to support more scalar function units. At the same time, there was a desire to support video encode and decode directly on these microprocessors. The confluence of these two trends led to the introduction of various fixed length vector extensions to existing general purpose architectures—MMX for the Intel x86, Altivec/VMX for the IBM PowerPC and MVI for the DEC Alpha, for instance.
These SIMD style architectures used registers with fixed byte lengths (8Bytes in the case of MMX, 16Bytes for Altivec). The registers were typically designed to hold multiple smaller length elements that can be operated on simultaneously. Thus, the MMX architecture could hold 2 4-Byte integers or 4 2-Byte integers or 8 1-Byte integers. The instructions PADDD/PADDW/PADDB would add the contents of two registers together, treating them as holding either 2 4-Byte/4 2-Byte/8 1-Byte values respectively.
As technology advances, it became possible to have the vector registers hold additional values. The MMX extension for the x86 architecture was followed by the 16-Byte SSE 32-Byte AVX2 and the 64-Byte AVX3. At each point, additional instructions were introduced to perform substantially the same operations.
In the case of implementations of general purpose architectures, for business reasons, different models are able to run the code written for older models. Thus, a newer implementation of the x86 architecture can support multiple different vector register widths, and instructions that operate on all of these instruction register widths.