Embodiments of the present invention relate to microprocessors and computers. More particularly, embodiment of the present invention relate to address prediction methods and apparatus.
Known microprocessors include pipelined instruction execution engines to increase microprocessor performance. An instruction pipeline can include a plurality of stages that perform instruction execution. For example, a simple instruction pipeline can include four stages: fetch instruction, fetch operands, execute instruction, and store result.
The fetch instruction stage typically retrieves the instruction to be executed based on an instruction pointer (IP) value stored in an instruction pointer register. The instruction pointer identifies the memory address (i.e., location) of the instruction. As each of a series of instructions is executed, the instruction pointer value is typically incremented an amount (e.g., a constant amount, a variable amount, etc.) to point to the address of the next instruction. At times, a new instruction pointer value can be loaded into the instruction pointer register to execute a specified set of instructions (e.g., when a subroutine is executed, to begin execution of a new program, after execution of a conditional branch instruction, etc.).
An instruction that is commonly executed by a microprocessor is a load instruction. A load instruction typically retrieves a data value from memory to load the data value into a processor register file. A component of microprocessor performance is the load-to-use latency. The load-to-use latency can be dependent on the amount of time required to load a data value from the main memory into a processor register file. When increased amounts of time are required to retrieve data from memory, microprocessor performance can be disadvantageously affected.
One technique known to reduce the load-to-use delay is to implement a memory hierarchy, which can include different levels of memory, where each level has a particular size and speed. A memory hierarchy can include on-chip memory (e.g., a level one cache memory that is on the same semiconductor chip as the microprocessor, a level one cache memory that is a portion of the microprocessor, etc.) and off-chip memory (e.g., a level two cache in a semiconductor chip that is in communication with the a microprocessor chip, etc.). Data stored in the on-chip memory typically can be retrieved significantly faster than data stored in the off-chip memory.
Frequently used data can be stored in the on-chip memory to increase microprocessor performance. When a data unit is to be retrieved, the on-chip memory can be checked to determine if the sought data unit is stored within the on-chip memory. When the memory contains the sought data unit; a xe2x80x9chitxe2x80x9d has occurred and the data can be retrieved from the memory. When the memory does not contain the sought data unit, a xe2x80x9cmissxe2x80x9d has occurred and the next level of memory can be checked to determine if the sought data unit is stored in that next level. An exemplary memory hierarchy can include, in order of increasing load-to-use latency, on-chip cache memory (e.g., an on-chip L0 cache, an on-chip L1 cache, an on-chip L2 cache), off-chip cache memory (e.g., an L2 cache, an L3 cache), main memory, etc. Retrieval of data from the lower levels of the memory hierarchy (e.g., the main memory, etc.) usually has significantly higher load-to-use delays than retrieval of data from the higher levels of the memory hierarchy (e.g., on-chip cache memory, etc.).
Increasing the speed and size of the on-chip cache(s) is a known method to reduce the load-to-use delay. By storing greater amounts of data in the faster, higher levels of memory, the overall load-to-use latency can be reduced by increasing the proportion of data retrievals that retrieve data from faster on-chip cache(s) and reducing the proportion of data retrievals that access slower, lower levels of memory to retrieve the sought data.
Even when data is retrieved from an on-chip cache, the overall load-to-use latency can be dependent on load address generation, e.g., the amount of time taken to generate the memory address of the data to be loaded. For example, an instruction pointer can identify the address in memory where a first load instruction is stored. The first load instruction can be retrieved from the memory based on the instruction pointer. The first load instruction can include source operands that specify the memory location where the data to be loaded can be retrieved, and the actual memory load address may need to be computed based on the source operands.
Generation of the complete load address can be required prior initiating a cache access (e.g., for larger sized caches). To initiate cache access earlier, load memory addresses can be predicted. When load address prediction is performed early in the pipeline, e.g., at the time the instruction is fetched, cache access based on the predicted address and calculation of the actual address can be overlapped during the front part of the instruction pipeline. This can reduce the load-to-use latency.
Known load address prediction schemes typically only detect regular memory accesses using strides. A stride can be a fixed offset between successive memory load addresses. For example, when data is being accessed from a data array (e.g., a data table), the load address may be incremented by a constant value each time a load instruction is executed. In such an instance, load addresses can be predicted by incrementing the most recent load address by the constant value to generate a predicted address. Many sequences of data, however, are not regular (e.g., each load address of a particular load being offset by a constant value from the load address of the previous instance of the particular load, etc.). Absent such offsets of a constant value, a stride-predictor cannot operate advantageously.
In view of the foregoing, it can be appreciated that a substantial need exists for methods and apparatus which can advantageously perform correlated address prediction.
Embodiments of the present invention include apparatus and methods to perform correlated address prediction. A microprocessor can include a correlated address predictor that includes a first table memory and a second table memory. The first table memory can be populated by a plurality of buffer entries. Each buffer entry can include a first buffer field to store a first tag based on an instruction pointer and a second buffer field to store an address history. The second table memory can be populated by a plurality of link entries. Each link entry can include a first link field to store a link tag based on an address history and a second link field to store a predicted address. A first comparator can be in communication with the first table memory and an instruction pointer input. A second comparator can be in communication with the first table memory and the second table memory. An output in communication with the second table memory.