1. Technical Field
The present invention relates to an improved data processing system, and in particular, the present invention is directed to a system and method in a multi-threaded processor for maintaining a best-case demand redispatch of an instruction.
2. Description of Related Art
Microprocessor chips generally comprises various hardware elements including processing cores and cache memory. A processing core may include multiple sub-elements such as one or more floating point units, load/store units, instruction sequencing units, fixed point execution units, instruction fetch units, instruction dispatch units, branch execution units, and possibly other sub-elements. The terms “processing core” and “processor” are used interchangeably herein to refer to the same device.
A technique used in advanced microprocessors where the microprocessor begins executing a second instruction before the first has been completed is called pipelining. That is, several instructions are in the pipeline simultaneously, each at a different processing stage. The pipeline is divided into segments and each segment can execute its operation concurrently with the other segments. When a segment completes an operation, it passes the result to the next segment in the pipeline and fetches the next operation from the preceding segment. The final results of each instruction emerge at the end of the pipeline in rapid succession. This arrangement allows all the segments to work in parallel thus giving greater throughput than if each input had to pass through the whole pipeline before the next input could enter. The costs are greater latency and complexity due to the need to synchronize the segments in some way so that different inputs do not interfere. The pipeline will only work at full efficiency if it can be filled and emptied at the same rate that it can process.
Cache memory, which is typically comprised of high-speed static RAM (SRAM) devices, is used for holding instructions and/or data that are likely to be accessed in the near term by the processor. Because processors have increased in operating speed at a much faster pace than RAM as technology has progressed, without adequate system design, a processor could waste much of its time waiting to obtain instructions or data from memory, rather than performing calculations. One of the fastest types of RAM, Static RAM (SRAM), is generally too expensive to use for all of system's memory needs. As a compromise, computers generally come with a relatively small amount of SRAM that is used as cache memory. If an instruction or data stored in cache is required again, the computer can access the cache for the instruction/data rather than having to access the relatively slower DRAM. Since the cache memory is organized more efficiently, the time to find and retrieve information is reduced and the CPU is not left waiting for more information.
Many processors have two types of cache memory: level 1 and level 2. Level 1 (L1) cache has a very fast access time, and is typically embedded as part of the processor device itself. Level 2 (L2) is typically situated near, but separate from, the processor and has an interconnecting bus to the processor. Modern processors may also have both L1 and L2 caches integrated into their core or chip. Some systems also have a separate instruction cache and data cache.
Lookahead execution is another technique for improving the overall performance of a data processing system. Lookahead execution offers a solution to the performance problem from long stalls due to cache misses and translation misses. Load/store performance in an in-order machine can be improved by speculatively continuing to execute instructions in a “lookahead mode” during stalled periods in order to generate addresses that will be needed in the L1 cache and translation mechanism. This assures that needed data is available when the stall period ends and normal dispatch resumes, avoiding additional stalls. Lookahead Mode functions somewhat like a very exact prefetch mechanism which allows the load/store unit to make use of processor cycles that would otherwise go unused. For example, if a demand instruction misses the L1 cache, rather than stall the pipeline and wait for the data to return, the processor core will place a mark at the rejected instruction and then continue executing the instruction stream beyond that point while waiting for the data to return for the demand instruction. While this lookahead execution takes place, the caches and prediction arrays may be updated, but the architected facilities cannot. Lookahead execution may thus greatly reduce future penalties resulting from L1 misses and surprise branches.
The L2 cache may also improve performance by sending a speculative data-coming indication to the processor core several cycles before the data for a fetch request is actually available to the core. Existing systems use this data-coming indication to allow the processor core to prepare itself to use the data without further delay of the stalled instruction that caused the fetch and cannot complete until the data is available. However, a problem encountered when the lookahead execution feature is employed is that when a restart is generated from the L2 cache's data-coming warning, the instruction dispatch unit cannot deliver the stalled instruction back to the load/store unit (LSU) quickly enough to use the fetch data as soon as it is available to the processor core. Failure to deliver the stalled instruction quickly enough would result in a performance degrading cycle gap between when fetch data is available in the core, and when the demand instruction actually uses that data. This degradation would be a direct result of implementing the performance increasing lookahead feature.
Therefore, it would be advantageous to have a method and system for maintaining a best-case demand redispatch of an instruction to allow for maximizing the time a rejected thread may execute in lookahead execution mode while maintaining the smallest L1 miss penalty supported by the cache hierarchy design.