1. Field of the Invention
The present invention relates to the field of computer memory, and more particularly to cache memory.
2. Description of Background
Unaligned data access is when requested data spans across multiple cache lines in a cache memory device. Traditionally, when data are not aligned to fit in a single cache-line, two separate loads, and logic for extracting and merging requested data from two cache lines is typically required. A large number of unaligned accesses can degrade the performance of a processor core.
Frequently, a compiler cannot determine ahead of time a data address that avoids spanning multiple cache lines. To access data which crosses over two cache lines, the two cache lines need to be accessed, and the data have to be merged. This can be accomplished with several different techniques:
1. Sequentially, e.g., using microcode: In this approach, when an unaligned data access is detected, a microcode sequence is initiated. The request is translated into two consecutive load requests, and data are then assembled. The fact that data is unaligned is known only after the data address is computed, thus requiring flush and replay. While the performance penalty of this approach is much lower than invoking an exception, this solution is slow and requires many cycles for completion. This solution is used in Power4 microprocessor architecture and other Power architectures (P5, K4, P7).
2. In parallel, e.g., using multi-port caches: This approach uses multiport caches, and it can access two lines and merge data. While very fast, it doubles the number of required cache ports, and is power inefficient, as two cache lines are accessed for all cache accesses, independently if data are spanning multiple cache lines or not, as the spanning condition is only known after address compute. This has consequences for compilers, which need to provide code for handling unaligned data with minimal performance penalty.
Unaligned cache access is especially pronounced for Single Instruction, Multiple Data (SIMD) code, where several vector elements are accessed and processed in parallel. To minimize performance penalty when handling unaligned data for SIMD operations, a shuffle instruction can be used to shuffle data based on their alignment. This introduces a small performance penalty for each SIMD load, independently if data is aligned or not, as a shuffle instruction has to be executed every time. If the architecture does not support a data shuffle instruction, a compiler needs to have code versioning to separately handle aligned and unaligned data (for example, to generate scalar instructions for unaligned data, and SIMD code for aligned data, a condition detected at the run time).