1. Field of the Invention
The present invention relates to techniques for improving computer system performance. More specifically, the present invention relates to instructions for efficiently accessing a vector located at an arbitrarily aligned memory address.
2. Related Art
In Single-Instruction-Multiple-Data (SIMD)-vector processors, accessing a vector in memory that is not naturally aligned (i.e., which resides at an address that is not an integer multiple of the vector length in bytes) is an inefficient multi-step process, which is complicated by the need to handle edge-cases without producing spurious virtual-memory faults. For example, see FIG. 1A, which illustrates both an aligned vector 102 and an unaligned vector 104.
Referring to FIG. 1B, conventional practice for manipulating unaligned vectors is to process the beginning 108 and ending portions 110 of the unaligned vectors separately, thereby allowing the large inner portion of the vectors to be processed as naturally aligned vectors 112, which is much faster. This is a complicated process and is not always possible to do in all systems and applications.
Many processors that support vector data types provide memory-access instructions that automatically handle misalignment by loading vector data from unaligned addresses into vector registers or storing data from vector registers to unaligned addresses. For example, FIG. 1B illustrates how an unaligned vector which spans two registers is aligned to fit into a single register. This approach places the burden of aligning data on the processor hardware, which must perform multiple aligned memory accesses and must assemble elements from each access into a coherent vector. Note that this hardware-based technique requires additional hardware and is consequently more expensive. Furthermore, this technique is inefficient for streaming because it discards good data during the streaming process.
Explicitly handling alignment in software (rather than in hardware) is even less efficient because it involves executing multiple load-store and bit-manipulation instructions for each vector of data that is processed.
It is also common for some types of code, such as mathematical kernels, to be implemented in several variants, each handling a different alignment case as efficiently as possible. This approach is time-consuming and error-prone, and also increases the debugging effort and lengthens the development process. Furthermore, the variants handling unaligned data are less efficient than the aligned variants. This difference in efficiency can cause performance variations that depend on the alignment of the data.
Hence, what is needed is a technique for efficiently accessing unaligned vectors without the above-described problems.