1. Field of the Invention
This invention relates to the field of microprocessors and, more particularly, to mechanisms for fetching data into microprocessors.
2. Description of the Related Art
Superscalar microprocessors achieve high performance by executing multiple instructions per clock cycle and by choosing the shortest possible clock cycle consistent with the design. On the other hand, superpipelined microprocessor designs divide instruction execution into a large number of subtasks which can be performed quickly, and assign pipeline stages to each subtask. By overlapping the execution of many instructions within the pipeline, superpipelined microprocessors attempt to achieve high performance. As used herein, the term "clock cycle" refers to an interval of time accorded to various stages of an instruction processing pipeline within the microprocessor. Storage devices (e.g. registers and arrays) capture their values according to the clock cycle. For example, a storage device may capture a value according to a rising or falling edge of a clock signal defining the clock cycle. The storage device then stores the value until the subsequent rising or falling edge of the clock signal, respectively. The term "instruction processing pipeline" is used herein to refer to the logic circuits employed to process instructions in a pipelined fashion. Although the pipeline may be divided into any number of stages at which portions of instruction processing are performed, instruction processing generally comprises fetching the instruction, decoding the instruction, executing the instruction, and storing the execution results in the destination identified by the instruction.
Superscalar microprocessors demand high memory bandwidth due to the number of instructions executed concurrently and due to the increasing clock frequency (i.e. shortening clock cycle) employed by the superscalar microprocessors. Many of the instructions include memory operations to fetch (read) and update (write) memory operands in addition to the operation defined for the instruction. The memory operands must be fetched from or conveyed to memory, and each instruction must originally be fetched from memory as well. Similarly, superpipelined microprocessors demand high memory bandwidth because of the high clock frequency employed by these microprocessors and the attempt to begin execution of a new instruction each clock cycle. It is noted that a given microprocessor design may employ both superscalar and superpipelined techniques in an attempt to achieve the highest possible performance characteristics.
Microprocessors are often configured into computer systems which have a relatively large, relatively slow main memory. Typically, multiple dynamic random access memory (DRAM) modules comprise the main memory system. The large main memory provides storage for a large number of instructions and/or a large amount of data for use by the microprocessor, providing faster access to the instructions and/or data then may be achieved from a disk storage, for example. However, the access times of modern DRAMs are significantly longer than the clock cycle length of modern microprocessors. The memory access time for each set of bytes being transferred to the microprocessor is therefore long. Accordingly, the main memory system is not a high bandwidth system. Microprocessor performance may suffer due to a lack of available memory bandwidth.
In order to relieve the bandwidth requirements on the memory system (or alternatively to increase the number of instructions executable per clock cycle given a fixed amount of available bandwidth), microprocessors typically employ one or more caches to store the most recently accessed data and instructions. Many programs have memory access patterns which exhibit locality of reference, particularly for data (e.g. memory operands used by instructions). A memory access pattern exhibits locality of reference if a memory operation to a particular byte of main memory indicates that memory operations to other bytes located within the main memory at addresses near the address of the particular byte are likely. Generally, a "memory access pattern" is a set of consecutive memory operations performed in response to a program or a code sequence within a program. The addresses of the memory operations within the memory access pattern may have a relationship to each other. For example, the memory access pattern may or may not exhibit locality of reference.
When programs exhibit locality of reference, cache hit rates (i.e. the percentage of memory operations for which the requested byte or bytes are found within the caches) are high and the bandwidth required from the main memory is correspondingly reduced. When a memory operation misses in the cache, the cache line (i.e. a block of contiguous data bytes) including the accessed data is fetched from main memory and stored into the cache. A different cache line may be discarded from the cache to make room for the newly fetched cache line.
Unfortunately, certain code sequences within a program may have a memory access pattern which does not exhibit locality of reference. More particularly, code sequences may access a sequence of data stored at addresses separated by a fixed stride from each other (a "strided memory access pattern"). In other words, a first datum may be accessed at a first address; subsequently, a second datum may be accessed at a second address which is the sum of the fixed stride and the first address; subsequently, a third datum may be accessed at a third address which is the sum of the fixed stride and the second address; etc.
The strided memory access pattern may occur when a code sequence accesses many common data structures. For example, FIG. 1 illustrates a memory 5 in which a two dimensional array of data is stored. In FIG. 1, data stored at larger numerical addresses is shown to the right of data stored at smaller numerical addresses. In a two dimensional array, data is organized into rows and columns. A given datum is located in the array by specifying its row and column number. The storage of the first and second rows of the array within memory 5 is depicted in FIG. 1. Each row comprises elements stored in columns 1 through N (C1 through CN). Each element C1-CN may occupy one or more bytes, and may be different in size from each other. The elements of each row are stored in contiguous memory locations, and the last element (CN) of the first row is stored in a memory location contiguous to the first element of the second row (C1). The arrangement shown in FIG. 2 is often referred to as the "row major" representation of a two dimensional array. The "column major" representation has each element within a given column stored in contiguous memory locations.
For the row major representation of the array, accessing each of the elements of a given column of the array in succession comprises a strided memory access pattern. For example, each of the C2 elements of the rows of the array are separated within memory by the number of bytes occupied by one row of the array. In the column major representation of the array, accessing each of the elements of a given row of the array in succession comprises a strided memory access pattern. Furthermore, accessing any multi-dimensional array (2 dimensional, 3 dimensional, etc.) exhibits a strided memory access pattern when the elements stored in any one of the axes of the array is traversed.
If the fixed stride manifested by the strided memory access pattern is larger than a cache line, each subsequent access may miss the data cache unless the cache line containing the subsequent access was previously cached. Performance of the code sequence may suffer due to the low cache hit rates associated with the code sequence.