Modern processors often include instructions to provide operations that are computationally intensive, but offer a high level of data parallelism that can be exploited through an efficient implementation using various data storage devices, such as for example, single instruction multiple data (SIMD) vector registers.
Vectorizing an application or software code may include making the application compile, install, and/or run on specific systems or instruction-set architectures, such as for example, a wide or large width vector architecture. For some applications, memory access may be complex, inconsistent, or noncontiguous, for example, as vector widths increase (e.g., for operations such as three dimensional (3D) image rendering). Memory used for vectorized processes may be stored in noncontiguous or non-adjacent memory locations. A number of architectures may require extra instructions which minimizes instruction throughput and significantly increase the number of clock cycles required to order data in the registers before performing any arithmetic operations.
Mechanisms for improving memory access and ordering data to and from wider vectors may include implementing gathering and scattering operations for generating local contiguous memory access for data from other non-local and/or noncontiguous memory locations. Gather operations may collect data from a set of noncontiguous or random memory locations in a storage device and combine the disparate data into a packed structure. Scatter operations may disperse elements in a packed structure to a set of noncontiguous or random memory locations. Some of these memory locations may not be cached, or may have been paged out of physical memory.
If gather operations are interrupted for a page fault or some other reason, with some architectures, the state of the machine may not be saved, requiring a repeat of the entire gather operation rather than a restart where the gather operation was interrupted. Since multiple memory accesses may be required on any gather operation, many clock cycles may be required for completion, for which any subsequent dependent arithmetic operations must necessarily wait. Such delays represent a bottleneck, which may limit performance advantages otherwise expected for example, from a wide or large width vector architecture.
To date, potential solutions to such performance limiting issues and bottlenecks have not been adequately explored.