Modern processors often include instructions to provide operations that are computationally intensive, but offer a high level of data parallelism that can be exploited through an efficient implementation using various data storage devices, such as for example, single instruction multiple data (SIMD) vector registers.
One potential performance issue for processors using SIMD vector registers is that data stored in physical memory may be located in a way which requires rearranging in the vector registers in order to apply the desired memory and/or SIMD arithmetic operations, for example, data at misaligned addresses, or at the ends and the beginnings of two respective cache lines, or in separate entries of a table, or across block boundaries in an image, etc.
Some processors in the past have implemented instructions to handle certain special cases of these potential performance issues, such as handling misaligned addresses, or performing special rearrangements for a particular transformation. Yet, implementations to handle certain special cases may be difficult to adapt more generally, and/or may require either more specialized circuitry or preprocessing of data to make an adaptation. Such implementations may limit performance advantages otherwise expected for example, from a wide or large width vector architecture.
To date, potential solutions to such performance limiting issues and bottlenecks have not been adequately explored.