1. Field of the Invention
The present invention generally relates to executing instructions in a processor. Specifically, this application is related to minimizing stalls in a processor due to store-load conflicts.
2. Description of the Related Art
Modern computer systems typically contain several integrated circuits (ICs), including a processor which may be used to process information in the computer system. The data processed by a processor may include computer instructions which are executed by the processor as well as data which is manipulated by the processor using the computer instructions. The computer instructions and data are typically stored in a main memory in the computer system.
Processors typically process instructions by executing the instruction in a series of small steps. In some cases, to increase the number of instructions being processed by the processor (and therefore increase the speed of the processor), the processor may be pipelined. Pipelining refers to providing separate stages in a processor where each stage performs one or more of the small steps necessary to execute an instruction. In some cases, the pipeline (in addition to other circuitry) may be placed in a portion of the processor referred to as the processor core. Some processors may have multiple processor cores, and in some cases, each processor core may have multiple pipelines. Where a processor core has multiple pipelines, groups of instructions (referred to as issue groups) may be issued to the multiple pipelines in parallel and executed by each of the pipelines in parallel.
As an example of executing instructions in a pipeline, when a first instruction is received, a first pipeline stage may process a small part of the instruction. When the first pipeline stage has finished processing the small part of the instruction, a second pipeline stage may begin processing another small part of the first instruction while the first pipeline stage receives and begins processing a small part of a second instruction. Thus, the processor may process two or more instructions at the same time (in parallel).
To provide for faster access to data and instructions as well as better utilization of the processor, the processor may have several caches. A cache is a memory which is typically smaller than the main memory and is typically manufactured on the same die (i.e., chip) as the processor. Modern processors typically have several levels of caches. The fastest cache which is located closest to the core of the processor is referred to as the Level 1 cache (L1 cache). In addition to the L1 cache, the processor typically has a second, larger cache, referred to as the Level 2 Cache (L2 cache). In some cases, the processor may have other, additional cache levels (e.g., an L3 cache and an L4 cache).
Processors typically provide load and store instructions to access information located in the caches and/or main memory. A load instruction may include a memory address (provided directly in the instruction or using an address register) and identify a target register (Rt). When the load instruction is executed, data stored at the memory address may be retrieved (e.g., from a cache, from main memory, or from other storage means) and placed in the target register identified by Rt. Similarly, a store instruction may include a memory address and a source register (Rs). When the store instruction is executed, data from Rs may be written to the memory address. Typically, load instructions and store instructions utilize data cached in the L1 cache.
In some cases, when a store instruction is executed, the data being stored may not immediately be placed in the L1 cache. For example, after the load instruction begins execution in a pipeline, it may take several processor cycles for the load instruction to complete execution in the pipeline. As another example, the data being stored may be placed in a store queue before being written back to the L1 cache. The store queue may be used for several reasons. For example, multiple store instructions may be executed in the processor pipeline faster than the stored data is written back to the L1 cache. The store queue may hold the results for the multiple store instructions and thereby allow the slower L1 cache to later store the results of the load instructions and “catch up” with the faster processor pipeline. The time necessary to update the L1 cache with the results of the store instruction may be referred to as the “latency” of the store instruction.
Where data from a store instruction is not immediately available in the L1 cache due to latency, certain instruction combinations may result in execution errors. For example, a store instruction may be executed which stores data to a memory address. As described above, the stored data may not be immediately available in the L1 cache. If a load instruction which loads data from the same memory address is executed shortly after the store instruction, the load instruction may receive data from the L1 cache before the L1 cache is updated with the results of the store instruction.
Thus, the load instruction may receive data which is incorrect or “stale” (e.g., older data from the L1 cache which should be replaced with the results of the previously executed store instruction). Where a load instruction loads data from the same address as a previously executed store instruction, the load instruction may be referred to as a dependent load instruction (the data received by the load instruction is dependent on the data stored by the store instruction). Where a dependent load instruction receives incorrect data from a cache as a result of the latency of a store instruction, the resulting execution error may be referred to as a load-store conflict.
Because the dependent load instruction may have received incorrect data, subsequently issued instructions which use the incorrectly loaded data may also be executed improperly and reach incorrect results. To detect such an error, the memory address of the load instruction may be compared to the memory address of the store instruction. Where the memory addresses are the same, the load-store conflict may be detected. However, because the memory address of the load instruction may not be known until after the execution of the load instruction, the load-store conflict may not be detected until after the load instruction has been executed.
Thus, to resolve the detected error, the executed load instruction and the subsequently issued instructions may be flushed from the pipeline (e.g., the results of the load instruction and subsequently executed instructions may be discarded) and each of the flushed instructions may be reissued and re-executed in the pipeline. While the load instruction and subsequently issued instructions are being invalidated and reissued, the L1 cache may be updated with the data stored by the store instruction. When the reissued load instruction is executed the second time, the load instruction may then receive the correctly updated data from the L1 cache.
Executing, invalidating, and reissuing the load instruction and subsequently executed instructions after a load-store conflict may take many processor cycles. Because the initial results of the load instruction and subsequently issued instructions are invalidated, the time spent executing the instructions is essentially wasted. Thus, load-store conflicts typically result in processor inefficiency.
Accordingly, there is a need for improved methods of executing load and store instructions.