Modern processors often include instructions to provide operations that are computationally intensive, but offer a high level of data parallelism that can be exploited through an efficient implementation using various data storage devices, such as for example, single instruction multiple data (SIMD) vector registers.
For some applications, memory access may be complex, inconsistent, or noncontiguous, for example, for operations such as three dimensional (3D) image rendering. The memory being used by vectorized processes may not always be contiguous or in adjacent memory locations. A number of architectures may require extra instructions to order data in the registers before performing any arithmetic operations, which minimizes instruction throughput and significantly increase the number of clock cycles required.
Mechanisms for improving memory access and ordering data to and from wider vectors may include implementing gathering and scattering operations for generating local contiguous memory access for data from other non-local and/or noncontiguous memory locations. Gather operations may collect data from a set of noncontiguous or random memory locations in a storage device and combine the disparate data into a packed structure. Scatter operations may disperse elements in a packed structure to a set of noncontiguous or random memory locations.
Additionally some of these memory locations may not be cached, or may have been paged out of physical memory. If gather operations are interrupted for a page fault or some other reason, with some architectures, the state of the machine may not be saved, requiring a repeat of the entire gather operation rather than a restart where the gather operation was interrupted. Since multiple memory accesses may be required on any gather operation, many clock cycles may be required for completion, for which any subsequent dependent arithmetic operations must necessarily wait. Such delays represent a bottleneck, which may limit performance advantages otherwise expected, for example, from a wide or large width vector architecture.
Alternative mechanisms for improving memory access and ordering data to and from wider vectors may include causing parallel loads or stores of separated words to or from a data vector using different memory chips in a computer. Again, some of these memory locations may have been paged out of physical memory, and so the issues remain for restarting operations that are interrupted for a page fault or some other reason, but this time the loads or stores may be executing in parallel. Hence, resolving such faults in a correct order may be difficult or require serialization and all of the loads or stores may need to be completed prior to the resolving of such faults.
Some mechanisms may include implementing gathering and scattering using completion masks to track the completion of the individual loads and stores respectively, but the physical register storage for vector registers and completion masks may be closer to execution units with wide data paths for performing SIMD type arithmetic rather than, for example, address generation logic for accessing memory. In such cases, generating addresses for accessing non-local and/or noncontiguous memory locations from individual data elements in the vector registers and tracking the individual completion masks, could also reduce the benefits expected from performing a wide SIMD type gather or scatter operation.
To date, potential solutions to such performance limiting issues and bottlenecks have not been adequately explored.