1. Field of the Invention
The present invention relates to techniques for improving computer system performance. More specifically, the present invention relates to instructions for efficiently accessing a partial vector located at an arbitrarily aligned memory address.
2. Related Art
In Single-Instruction-Multiple-Data (SIMD)-vector processors, accessing a vector in memory that is not naturally aligned (i.e., which resides at an address that is not an integer multiple of the vector length in bytes) is an inefficient multi-step process, which is complicated by the need to handle edge-cases without producing spurious virtual-memory faults. For example, see FIG. 1A, which presents a block diagram illustrating an existing memory 100 that includes both an aligned vector 102 and an unaligned vector 104.
Referring to FIG. 1B, which presents a block diagram illustrating an existing alignment process, conventional practice for manipulating unaligned vectors is to process the beginning 108 and ending portions 110 of the unaligned vectors separately, thereby allowing the large inner portion of the vectors to be processed as a naturally aligned vector 112, which is much faster. However, this is a complicated process and it is not always possible to perform it in all systems and applications.
Many processors that support vector data types provide memory-access instructions that automatically handle misalignment by loading vector data from unaligned addresses into vector registers or storing data from vector registers into unaligned addresses. For example, FIG. 2B presents a block diagram of alignment circuitry 204 illustrating how an unaligned vector which spans two or more registers is aligned to fit into a single register. However, this approach places the burden of aligning data on the processor hardware, which must perform multiple aligned memory accesses and must assemble elements from each access into a coherent vector. Note that this hardware-based technique requires additional hardware and is consequently more expensive. Furthermore, this technique is inefficient for streaming because it discards good data during the streaming process.
Explicitly handling alignment in software (rather than in hardware) is even less efficient because it involves executing multiple load-store and bit-manipulation instructions for each vector of data that is processed.
It is also common for some types of code, such as mathematical kernels, to be implemented in several variants, each handling a different alignment case as efficiently as possible. This approach is time-consuming and error-prone, and also increases the debugging effort and lengthens the development process. Furthermore, the variants which handle unaligned data are less efficient than the aligned variants. This difference in efficiency can cause performance variations that depend on the alignment of the data.
Hence what is needed is a technique for efficiently accessing unaligned vectors without the above-described problems.