Modern processors often include instructions to provide operations that are computationally intensive, but offer a high level of data parallelism that can be exploited through an efficient implementation using various data storage devices, such as for example, single instruction multiple data (SIMD) vector registers.
Vectorizing an application or software code may include making the application compile, install, and/or run on specific systems or instruction-set architectures, such as for example, a wide or large width vector architecture. For some applications, memory access may be complex, inconsistent, or noncontiguous, for example, as vector widths increase (e.g., for operations such as three dimensional (3D) image rendering). Memory used for vectorized processes may be stored in noncontiguous or non-adjacent memory locations. A number of architectures may require extra instructions which minimizes instruction throughput and significantly increase the number of clock cycles required to order data in the registers before performing any arithmetic operations.
Mechanisms for improving memory access and ordering data to and from wider vectors may include implementing gathering and scattering operations for generating local contiguous memory access for data from other non-local and/or noncontiguous memory locations. Gather operations may collect data from a set of noncontiguous or random memory locations in a storage device and combine the disparate data into a packed structure. Scatter operations may disperse elements in a packed structure to a set of noncontiguous or random memory locations. Other mechanisms may include loading and storing with a regular stride to collect data from a set of noncontiguous memory locations in a storage device and combine the data into a packed structure, or to disperse elements in a packed structure to a set of noncontiguous memory locations in a storage device. Still other mechanisms may include loading and storing to collect data from a set of contiguous memory locations in a storage device and distribute the data sparsely into a vector structure, or to consolidate elements in a sparse vector structure to a set of contiguous memory locations in a storage device. Vectorizing an application or software code using such mechanisms may include conditional loading and storing memory locations using predication masks. Some of these memory locations may not be cached, or may have been paged out of physical memory.
If these operations are interrupted in the middle of loading and storing memory locations for a page fault or some other reason, with some architectures, the state of the machine may not be saved, requiring a repeat of the entire operation rather than a restart where the operation was interrupted. Since multiple memory accesses may be required on any repeated operation, many clock cycles may be required for completion, for which any subsequent dependent arithmetic operations must necessarily wait. In some alternative architectures, after any successful loading and storing of memory locations the state of the machine may instead be saved, but any occurring faults may be deferred until the retirement of the vector instruction. In the case of conditionally loading and storing memory locations using predication masks, some occurring faults or exceptions may correspond to hypothetical memory accesses for which a predication mask may have been used to suppress the actual memory accesses. Yet, processing of any deferred faults or exceptions at retirement of the vector instruction may incur delays even when a predication mask had been used to suppress the faulting memory accesses. Such delays represent a bottleneck, which may limit performance advantages otherwise expected for example, from a wide, or large width vector architecture.
To date, potential solutions to such performance limiting issues and bottlenecks have not been adequately explored.