1. Field of the Invention
The present invention relates in general to microprocessors and, more particularly, to a system, method, and microprocessor architecture providing a cache throttle in a non-blocking hierarchical cache.
2. Relevant Background
Modern processors, also called microprocessors, use techniques including pipelining, superpipelining, superscaling, speculative instruction execution, and out-of-order instruction execution to enable multiple instructions to be issued and executed each clock cycle. As used herein the term processor includes complete instruction set computers (CISC), reduced instruction set computers (RISC) and hybrids. The ability of processors to execute instructions has typically outpaced the ability of memory subsystems to supply instructions and data to the processors, however. Most processors use a cache memory system to speed memory access.
Cache memory comprises one or more levels of dedicated high-speed memory holding recently accessed data, designed to speed up subsequent access to the same data. Cache technology is based on a premise that programs frequently re-execute the same instructions. When data is read from main system memory, a copy is also saved in the cache memory, along with an index to the associated main memory. The cache then monitors subsequent requests for data to see if the information needed has already been stored in the cache. If the data had indeed been stored in the cache, the data is delivered immediately to the processor while the attempt to fetch the information from main memory is aborted (or not started). If, on the other hand, the data had not been previously stored in cache then it is fetched directly from main memory and also saved in cache for future access.
Typically, processors support multiple cache levels, most often two or three levels of cache. A level 1 cache (L1 cache) is usually an internal cache built onto the same monolithic integrated circuit (IC) as the processor itself. On-chip cache is the fastest (i.e., lowest latency) because it is accessed by the internal components of the processor. Also, latency to on-chip cache is usually predictable. On the other hand, off-chip cache is an external cache of static random access memory (SRAM) chips plugged into a motherboard. Off-chip cache has much higher latency, although is typically much shorter latency than accesses to main memory.
Modern processors pipeline memory operations to allow a second load operation to enter a load/store stage in an execution pipeline before a first load/store operation has passed completely through the execution pipeline. Typically, a cache memory that loads data to a register or stores data from the register is outside of the execution pipeline. When an instruction or operation is passing through the load/store pipeline stage, the cache memory is accessed. If valid data is in the cache at the correct address a "hit" is generated and the data is loaded into the registers from the cache. When requested data is not in the cache, a "miss" is generated and the data must be fetched from a higher cache level or main memory.
In a hierarchical cache system valid data may reside in any of a number of cache levels. Cache accesses generated by, for example, memory operations executing in the processor will have a variable latency depending on whether the data that is the target of the memory operation resides in cache, and if so, what is the lowest cache level that holds a valid copy of the data. Hence, any functional unit in the processor (e.g., arithmetic logic unit, instruction execution unit, and the like) cannot predict when operand data will be available or when the instruction will complete execution until the data is known to reside in the lowest cache level.
One method of handling this uncertainty is to delay instruction execution until all data required by that instruction is known to exist in the lowest cache level. This can be accomplished by stalling an execution pipeline until the memory operation is complete, then execution the instruction requiring the results of the memory operation. However, stalling the execution pipeline reduces overall processor performance. Because the execution pipeline includes a plurality of instructions at any given time, a pipeline stall results in a time penalty to all instructions currently in the pipeline when a stall occurs. Further, a pipeline stall can propagate to other pipelines due to data dependencies between instructions executing in different pipelines.
Another difficulty in hierarchical caches arises in filling a lower cache level with data returned from a higher cache level. To execute a line fill operation, a cache being filled must receive data and an address identifying the cache line that should hold the data. In conventional cache systems, the higher cache level returns both the data and the address. In the case of blocking cache designs, the returned data presents little difficulty because the higher cache level has absolute control over the address of the lower cache until the fill operation is complete. However, blocking caches stall the memory pipeline until the line fill operation completes and so slow memory access and overall processor performance.
Non-blocking caches enable the cache to process subsequent memory operations while a miss is being serviced by higher cache levels. While this speeds overall memory access, the non-blocking cache receives address signals from both the lower level devices generating cache access requests and higher cache levels trying to return data from cache misses. The higher level cache must now arbitrate for control over the lower level cache. In this sense, the cache fill operation is intrusive to the ability for the lower level cache's ability to service memory access requests. This arbitration increases cache complexity, slows memory access, and is inconsistent with high frequency design.
What is needed is an architecture and a method for operating a processor that non-intrusively handles cache fills and load misses.
In an in-order processor, instructions are executed in the order in which the instructions appear in the program code. Each instruction may generate results that are used by subsequent instructions. For example, a memory operation that loads data from memory into a register must be completed before an arithmetic operation that uses the data in the register can be executed to generate correct results. In these cases, the subsequent instruction is referred to as "dependent" or as having one or more dependencies on precedent instructions. A dependency is resolved when the precedent instructions have completed and their results are available to the dependent instruction. In-order processors execute instructions in program-defined order and so will not execute an instruction until all prior instructions from which it depends have completed.
Greater parallelism and higher performance are achieved by "out-of-order" processors that include multiple pipelines in which instructions are processed in parallel in an efficient order that takes advantage of opportunities for parallel processing that may be provided by the instruction code. Dependencies in an out-of-order processor must be tracked to prevent execution of a dependent instruction before completion of precedent instructions that resolve the dependencies in the dependent instruction. However, the processor cannot predict with certainty when the dependencies will be resolved due to the variable latencies of a hierarchical cache. Prior processors handle this problem by delaying execution of dependent instructions until the dependencies are resolved. This solution requires implementation of a feedback mechanism to inform the instruction scheduling portion of the processor when the dependency is resolved. Such feedback mechanisms are difficult to implement in a manner consistent with processors issuing and executing multiple instructions per cycle using a high frequency clock. Moreover, this solution also does not scale well. As attempts to execute more instructions in parallel are made, the number of resources required to track dependencies increases out of proportion to the number of resources used to actually execute instructions.
A need exists for a processor that executed dependent instructions in an efficient manner consistent with high frequency clock frequencies.