1. Technical Field
The present application relates generally to an improved data processing system. More specifically, the present application is directed to an apparatus and method for handling data cache misses out-of-order for asynchronous pipelines.
2. Description of Related Art
Most modern computing systems make use of caches to help speed up data transfers and instruction execution. These temporary caches serve as staging areas, and their contents are constantly changing. A memory cache is a memory bank that bridges main memory and the processor of a microprocessor chip. The memory cache is faster than main memory and allows instructions to be executed and data to be read and written at higher speed.
Instructions and data are transferred from main memory to the cache in blocks, using a look-ahead algorithm. The more sequential the instructions in the routine being executed or the more sequential the data being read or written, the greater chance the next required item will already be in the cache, resulting in better performance.
A level 1 (L1) cache is a memory bank built into the microprocessor chip. Also known as the “primary cache,” an L1 cache is the memory closest to the processor. A level 2 cache (L2) is a secondary staging area that feeds the L1 cache. Increasing the size of the L2 cache may speed up some applications but have no effect on others. The L2 cache may be built into the microprocessor chip, reside on a separate chip in a multi-chip package module or be a separate bank of chips on the motherboard, for example. Caches are typically static RAM (SRAM), while main memory is generally some variety of dynamic RAM (DRAM).
In addition to caching of data and instructions, many modern computing systems make use of pipelines for performing simultaneous, or parallel, processing. Operations are overlapped by moving data and/or instructions into a conceptual pipe with all stages of the pipe processing simultaneously. For example, while one instruction is being executed, the computer is decoding the next instruction. In vector processors, several steps in a floating point operation can be processed simultaneously.
Microprocessors and pipelines may be either in-order or out-of-order. In-order microprocessors or pipelines process instructions and data in the order in which they are dispatched. Out-of-order microprocessors or pipelines may process the instructions and data in a different order from the order in which they are dispatched. An out-of-order execution architecture takes code that was written and compiled to be executed in a specific order, reschedules the sequence of instructions, if possible, so as to make maximum use of processor resources, executes them, and then arranges them back in their original order so that the results can be written out to memory. To the user, the execution appears as if an ordered, sequential stream of instructions went into the processor and an identically ordered, sequential stream of computational results emerged. Only the processor knows in what order the program's instructions were actually executed.
Complexity arises in an in-order microprocessor when encountering L1 data cache misses, e.g., in response to execution of a load instruction in the pipeline. Because the in-order microprocessor requires the instructions and data to be processed in-order, most in-order microprocessors flush the instructions younger than the missed load right away. That is, any instructions in the pipeline that were placed in the pipeline after the missed load instruction are not executed by the pipeline since it is assumed that these instructions are dependent upon the missed load instruction or may otherwise modify the data associated with the missed load instruction.
Alternatively, some in-order microprocessors wait to flush the instructions and data in the pipeline until a dependency upon the load instruction that missed is encountered. This approach is better performing because it allows non-dependent instructions younger than the missed load instruction to execute even though there is an older outstanding instruction, i.e. the missed load instruction, which must be executed again later. This leads to out-of-order behavior in an in-order processor because the missed load instruction must be reissued when the data is present in the L1 data cache, effectively out-of-order in relation to the rest of the program flow.
Further complexity arises when there are multiple pipelines that a load instruction must travel through, and the pipelines are asynchronous to each other. Such a scenario may exist when the address generation and the cache access are done by a first pipeline, while the placing of data into the architected register is done by a second pipeline that is asynchronous to the first pipeline. Additional complexities arise when exceptions exist, sometimes very late, which may flush a load instruction in one of the asynchronous pipelines.