A processor core typically includes an instruction fetch unit for generating fetch requests to retrieve instructions from an instruction cache (IC). When an instruction is available (i.e., a cache hit), the fetched instruction is typically stored in a fetch queue. When the instruction is not available (i.e., a cache miss), a memory request is usually generated and sent to a lower level of memory to retrieve the instruction. The pipeline may then stall until the instruction becomes available by servicing the cache miss.
In recent processors, the instruction fetch includes a branch prediction unit (BPU). A current IP is provided, which the processor core uses to access the BPU that generates predictions for branches that belong to the current instruction fetch block associated with the current IP. The BPU's prediction granularity is N-byte (e.g., 32B, etc.). Based on the prediction outcomes, the BPU will generate the next fetch IP, which could be the current IP+N byte (if none is predicted taken), or the target address of a predicted taken branch. This next IP becomes the current IP in the next cycle, and is fed back to the BPU to generate the next IP.
The instruction fetch unit (IFU) is composed of the 3 units. 1) An instruction translation look-aside buffer (ITLB) that translated the current IP into a physical address, 2) the IC accessed by the physical address that returns the corresponding instruction bytes, and 3) an instruction stream buffer (ISB) that temporarily stores the cache lines sent by the lower level memory (e.g., L2) before being written into the IC (which may handle IC misses). The IFU's fetch access may occur at M-byte granularity (e.g., 16B), which may be equal to or lower than BPU's prediction bandwidth (N=M or N>M).
The IFU is a slave to the BPU and operates in a separate pipeline. The IFU's fetch follows the IPs that are generated by the BPU. If the BPU's prediction bandwidth is higher than the IFU's fetch bandwidth (e.g., N=32B vs. M=16B), there is a FIFO queue called a branch prediction queue (BPQ) that bridges the bandwidth gap between the two pipelines. The BPU makes 32B predictions every cycle and allocates up to two entries that contain the fetch IPs. The number of BPQ entry writes is determined by N/M. The IFU reads one BPQ entry at a time, obtains the fetch IP, accesses the ITLB and IC sequentially, then sends the corresponding instruction bytes (e.g., 16B) down the pipeline for instruction decode.
Because of the bandwidth mismatch (e.g., N>M) and possible stall conditions in the IFU (e.g., IC miss), the BPU tends to run ahead of the IFU and the BPQ tends to hold multiple valid entries that tell where for IFU needs to fetch instructions from in the future. The BPQ may become full, which may result in a stall of the BPU's prediction pipeline until it finds a free BPQ entry. Meanwhile, IFU will continue to consume the BPQ entries and send instruction bytes down the pipeline.
There could be cache misses in the ITLB or IC when the IFU is unable to send the instruction bytes. An ITLB miss occurs when ITLB cannot find the matching entry with a physical address corresponding to the current fetch IP. In this case, the IFU stalls and sends a request to the page miss handler (PMH). The IFU resumes fetching after the PMH returns the physical addresses. In a similar fashion, an IC miss can occur when the IC cannot find the matching entry with the instruction bytes corresponding to the current physical fetch address. In this case, the IFU stalls, allocates an ISB entry (for the miss), and sends a fetch request to the lower level memory. The fetch resumes after the lower level memory returns the cache line back to the ISB. The cache lines in the ISB will be eventually be written into the IC, which is determined based on a couple of restrictions related to the inclusion handling and the IC write port availability. The IFU is allowed to send the instruction bytes either directly from the ISB or from the IC after the ISB bytes are written back to the IC.
This stall may result in a delay in the execution of instructions, and thus reduce performance of the processor core. In order to improve performance of the processor, the IFU may generate speculative fetch requests to the lower level memory before IFU encounters an actual miss in attempt to hide delays. The speculative fetch requests could be wasteful if the matching cache line already exists in the IFU. Because the existence of the cache line is not known unless an IFU is looked up, a processor may use a mechanism to filter out unnecessary speculative fetch requests, or may access the unused read port while the IFU is stalled waiting for a prior miss to be serviced.