In specialized markets, a vector processor operating with a vector instructions set provides high-performance results. When vector instruction sets are implemented in a computer system, software may be written to take advantage of these vector instruction sets. For compatibility and standardization reasons, users expect this software to operate on all products distributed by the creator of the vector instruction set. Implementing the vector instruction set on existing architecture platforms is needed to ensure compatibility and standardization. In some cases, the vector instruction set may have to be implemented on an essentially scalar processor.
Typically, vectors are formed of many elements. Memory operations for these vectors are similarly divided into multiple elements. In addition, a ‘vector stride’ (VS) of the memory operation (how much each element in the operation is spaced from each other) is tracked. The base address specifies the location of the first element, while the second element is at the base address+VS, the third element is at the base address+2*VS, and so on.
In some cases, performing a general memory operation on a long vector can be time consuming (i.e., many clock cycles). This is because the elements being accessed are sequenced out according to their VS in a long and cumbersome general purpose flow. Each element must be individually accessed in a load or store operation, while the bits in-between the elements are left untouched by the load or store operation. In the case of operating on a 512b vector, for example, performing a general memory operation on that vector can be quite inefficient.
A condition that may contribute to efficient memory operations is when the VS of the memory operation indicates “unit-stride” elements. This means that the VS matches the element size of the vector being operated on so that all elements in the vector are continuous and consecutive in memory (the source or destination vector register matches the layout in memory). In addition, when VS=0, all memory instructions are unit-stride length. When operating with a 512b VL, it can be inefficient to individually access elements which are unit-strided.
Another condition contributing to efficient memory operations is the ‘vector mask’ (VM) of the instruction. The VM of an element indicates whether a memory operation should be applied to that element. For example, if the VM for an element is the Boolean value of true, then the memory operation should be applied to that element. Otherwise, the element retains its old value. If the VM for all elements in the vector is true, then all of the elements are operated on. This is known as “unmasked” code. Generally, most performance-critical code is unmasked. When operating with a 512b VL, it can be inefficient to assess the VM for each element in the vector.