1. Field of the Invention
The present invention generally relates to design structures, and more specifically design structures for executing instructions in a processor. Specifically, this application is related to minimizing pipeline stalls in a processor due to cache misses.
2. Description of the Related Art
Modern computer systems typically contain several integrated circuits (ICs), including a processor which may be used to process information in the computer system. The data processed by a processor may include computer instructions which are executed by the processor as well as data which is manipulated by the processor using the computer instructions. The computer instructions and data are typically stored in a main memory in the computer system.
Processors typically process instructions by executing the instruction in a series of small steps. In some cases, to increase the number of instructions being processed by the processor (and therefore increase the speed of the processor), the processor may be pipelined. Pipelining refers to providing separate stages in a processor where each stage performs one or more of the small steps necessary to execute an instruction. In some cases, the pipeline (in addition to other circuitry) may be placed in a portion of the processor referred to as the processor core. Some processors may have multiple processor cores, and in some cases, each processor core may have multiple pipelines. Where a processor core has multiple pipelines, groups of instructions (referred to as issue groups) may be issued to the multiple pipelines in parallel and executed by each of the pipelines in parallel.
As an example of executing instructions in a pipeline, when a first instruction is received, a first pipeline stage may process a small part of the instruction. When the first pipeline stage has finished processing the small part of the instruction, a second pipeline stage may begin processing another small part of the first instruction while the first pipeline stage receives and begins processing a small part of a second instruction. Thus, the processor may process two or more instructions at the same time (in parallel).
To provide for faster access to data and instructions as well as better utilization of the processor, the processor may have several caches. A cache is a memory which is typically smaller than the main memory and is typically manufactured on the same die (i.e., chip) as the processor. Modern processors typically have several levels of caches. The fastest cache which is located closest to the core of the processor is referred to as the Level 1 cache (L1 cache). In addition to the L1 cache, the processor typically has a second, larger cache, referred to as the Level 2 Cache (L2 cache). In some cases, the processor may have other, additional cache levels (e.g., an L3 cache and an L4 cache).
To provide the processor with enough instructions to fill each stage of the processor's pipeline, the processor may retrieve instructions from the L2 cache in a group containing multiple instructions, referred to as an instruction line (I-line). The retrieved I-line may be placed in the L1 instruction cache (I-cache) where the core of the processor may access instructions in the I-line. Blocks of data (D-lines) to be processed by the processor may similarly be retrieved from the L2 cache and placed in the L1 cache data cache (D-cache).
The process of retrieving information from higher cache levels and placing the information in lower cache levels may be referred to as fetching, and typically requires a certain amount of time (latency). For instance, if the processor core requests information and the information is not in the L1 cache (referred to as a cache miss), the information may be fetched from the L2 cache. Each cache miss results in additional latency as the next cache/memory level is searched for the requested information. For example, if the requested information is not in the L2 cache, the processor may look for the information in an L3 cache or in main memory.
In some cases, a processor may process instructions and data faster than the instructions and data are retrieved from the caches and/or memory. For example, where an instruction being executed in a pipeline attempts to access data which is not in the D-cache, pipeline stages may finish processing previous instructions while the processor is fetching a D-line which contains the data from higher levels of cache or memory. When the pipeline finishes processing the previous instructions while waiting for the appropriate D-line to be fetched, the pipeline may have no instructions left to process (referred to as a pipeline stall). When the pipeline stalls, the processor is underutilized and loses the benefit that a pipelined processor core provides.
Because the address of the desired data may not be known until the instruction is executed, the processor may not be able to search for the desired D-line until the instruction is executed. However, some processors may attempt to prevent such cache misses by fetching a block of D-lines which contain data addresses near (contiguous to) a data address which is currently being accessed. Fetching nearby D-lines relies on the assumption that when a data address in a D-line is accessed, nearby data addresses will likely also be accessed as well (this concept is generally referred to as locality of reference). However, in some cases, the assumption may prove incorrect, such that data in D-lines which are not located near the current D-line are accessed by an instruction, thereby resulting in a cache miss and processor inefficiency.
Accordingly, there is a need for improved methods and apparatuses for executing instructions and retrieving data in a processor which utilizes cached memory.