1. Field of the Invention
This invention relates to computing systems, and more particularly, to efficiently handling memory misaligned accesses within a processor.
2. Description of the Relevant Art
Modern microprocessors are typically coupled to one or more levels of a cache hierarchy in order to reduce the latency of the microprocessor's request for data in memory. The request may result from a read or a write operation during the execution of one or more software applications. Generally, a cache may store multiple cache lines, where a cache line holds several bytes of data in contiguous memory locations. A cache line may be treated as a unit for coherency purposes. In addition, a cache line may be a unit of allocation and deallocation in the cache. By having a unit of allocation and deallocation of several bytes in a cache, memory accesses may be more efficient and have a smaller latency than having a unit of one or a few bytes.
Generally speaking, a read or a write operation that accesses a chunk of data smaller than a cache line and included within the cache line may not have an additional access penalty. The entire cache line is accessed during the operation regardless of the smaller size. For example, a cache line may have a size of 64 bytes. Whether an accessed 4-byte data chunk begins at an offset of 0 or an offset of 50 does not alter the access time. However, if the accessed data chunk is not aligned within the cache line boundaries (i.e., crosses a cache line boundary), then extra work may need to be performed.
For example, if an accessed 16-byte data chunk of a 64-byte cache line has an offset of 56, then one portion of the data is included in one cache line and the remainder is included in another cache line. In such a case, the processor performs two read operations to obtain the two successive cache lines instead of a single cache line read operation for the single load instruction. The processor may also perform at least an additional third operation to align the requested data, such as one or more rotate operations and join operations.
The extra processing described above may greatly increase the latency of a read or a write operation, which in turn reduces system performance. In some cases, misaligned memory requests are handled with micro-code support. The micro-code sequences each cache line, where sequencing refers to accessing a second cache line behind a first cache line. Sequencing, in addition to later rotation operations, adds considerable latency and code density. Further still, when a memory misalignment is detected for a load instruction, the processor pipeline may need to be flushed to allow the micro-code to properly handle the misalignment. The pipeline flush further reduces system performance.
In view of the above, efficient methods and mechanisms for handling misaligned memory accesses within a processor are desired.