This relates generally to graphics processing and particularly to optimizing structures used in graphics processing.
An example of a structure used in graphics processing is a vector load/store for an OpenCL (Open Computing Language) graphics processing unit backend. See OpenCL 2.1 revision 231 Nov. 11, 2015 available in the Khronos Registry. Vector load and store structures are used in OpenCL to allow reading and writing vector types from a pointer to memory. In addition these load store operations are an important source of input/output intensive workloads.
For many graphics processing unit backend implementations, an implicitly vectorizing method is used to parallelize a kernel. So the vector type the user is using in the OpenCL kernel code will be translated into a vector in which the first level of the vector is the implicit single instruction multiple data (SIMD) lane, and each lane contains one vector to find in the OpenCL kernel code.
Typically the graphics processing unit backend cannot handle the full length of the user level vector. For example, even though a 64 byte or 16 double word (int16) cache line size is very common, some graphics processing units can only handle 4 double words (int4) at a time. Thus, to translate an int16 input would involve four int4 store instructions. If an input at SIMD 16 is used, then each int4 store instruction stores 16 int4 vectors at a time. Each int4 vector is a continuous address and the four vectors may be scattered to random addresses. This scattering of these vectors to random addresses results in inefficient retrieval.
Since the original destination is an int16 containing a complete cache line in some embodiments, the whole cache line is now split four times involving four different writes so it is extremely cache unfriendly.