The present disclosure generally relates to data processing, and more specifically, to a variable-length vector store instruction.
Vectors are collections of scalar values, also referred to as vector elements. A processor can perform operations on vector data values simultaneously, thus providing more efficient operations. During vector operations, it may be necessary to process portions of a vector. For example, in a simple DO loop, two vectors of numbers may be added and stored in a third vector.                for (i=0; i<7; i++)z[i]=x[i]+y[i]        
Vectorization is most optimal when the number of elements i in the operation is a multiple of the vector length. The vector length is implemented by the specifics of the processing hardware. For example, single precision computations have a vector length of four in one implementation. Four single precision elements are in a vector. So, vectorization works best if i, above, is a multiple of four.
Problems can occur when i is not a multiple of the vector length (four). A scalar is left over, which is sometimes referred to as “residue.” In the example above, i=7. The first four elements are loaded with a vector load operation. However, a second vector load operation might load the three leftover elements (residue) and then try to load a fourth element in memory that the operation may not be allowed to access.
If i=7 is known at compile time, the compiler can determine to perform one vector operation followed by three scalar operations for the three residue elements. However, this leads to “code bloat.” The number of operations performed is much greater than performing two vector operations, for example, if i had been equal to 8. Similarly, i may not be known at compile time. If i is not known at compile time but is computed at runtime, the program does not know what i is, so controls are needed to test if i is a multiple of vector length or if there are residue elements left over. Then it must be determined how many residue elements there are, etc. This type of operation also leads to code bloat.
In addition, loading four elements when only three are needed may result in the load spanning a page boundary, with the fourth element on a page that the program doesn't have read access to. So a page fault may occur.
Other solutions include a load vector under mask. This solution provides a vector load but also provides the ability to specify a mask so that only certain elements are accessed out of storage. However, this can lead to complications as well (when exceptions occur, etc.). A bit mask can lead to scenarios where only portions of byte might be written, leading to extreme complexity for hardware designers because updating portions of a byte is complex and must be performed atomically to avoid race conditions. Other disadvantageous aspects of a mask include the costly computations of the mask.
Variable length multi-register load instructions are another existing solution. A load length is specified. A load instruction then loads the correct number of elements of data. However, these instructions are difficult to implement in hardware. Also, the length is specified in a difficult-to-access special purpose register with high latency.
In another aspect of vector operations, programmers need both “big endian” and “little endian” execution processing to be addressed. The term “endian” refers to how bytes of a multi-byte element are ordered within memory as data is moved between registers and memory. A big endian processor stores the most significant byte of a multi-byte element in the lowest address of a consecutive range, and stores the least significant byte in the highest address. In contrast, a little endian processor stores the least significant byte in the lowest address.
Therefore, a need exists for the ability to process operations that specify a length that is not a multiple of the implemented vector length and to support both big and little endian processing.