1. Field of the Invention
The present invention relates in general to microprocessors and, more particularly, to a system, method, and microprocessor architecture providing a cache throttle in a non-blocking hierarchical cache.
2. Relevant Background
Modern processors, also called microprocessors, use techniques including pipelining, superpipelining, superscaling, speculative instruction execution, and out-of-order instruction execution to enable multiple instructions to be issued and executed each clock cycle. As used herein the term processor includes complete instruction set computers (CISC), reduced instruction set computers (RISC) and hybrids. The ability of processors to execute instructions has typically outpaced the ability of memory subsystems to supply instructions and data to the processors, however. Most processors use a cache memory system to speed memory access.
Cache memory comprises one or more levels of dedicated high-speed memory holding recently accessed data, designed to speed up subsequent access to the same data. Cache technology is based on a premise that programs frequently re-execute the same instructions and data. When data is read from main system memory, a copy is also saved in the cache memory, along with an index to the associated main memory. The cache then monitors subsequent requests for data to see if the information needed has already been stored in the cache. If the data had indeed been stored in the cache, the data is delivered immediately to the processor while the attempt to fetch the information from main memory is aborted (or not started). If, on the other hand, the data had not been previously stored in cache then it is fetched directly from main memory and also saved in cache for future access.
Modern processors support multiple cache levels, most often two or three levels of cache. A level 1 cache (L1 cache) is usually an internal cache built onto the same monolithic IC as the processor itself. On-chip cache is the fastest (i.e., lowest latency) because it is accessed by the internal components of the processor. On the other hand, off-chip cache is an external cache of static random access memory (SRAM) chips plugged into a motherboard. Off-chip cache has much higher latency, although is typically much shorter latency than accesses to main memory.
Modern processors pipeline memory operations to allow a second load operation to enter a load/store stage in an execution pipeline before a first load/store operation has passed completely through the execution pipeline. Typically, a cache memory that loads data to a register or stores data from the register is outside of the execution pipeline. When an instruction or operation is passing through the load/store pipeline stage, the cache memory is accessed. If valid data is in the cache at the correct address a "hit" is generated and the data is loaded into the registers from the cache. When requested data is not in the cache, a "miss" is generated and the data must be fetched from a higher cache level or main memory. The latency (i.e., the time required to return data after a load address is applied to the load/store pipeline) of higher cache levels and main memory is significantly greater than the latency of lower cache levels.
The instruction execution units in the execution pipeline cannot predict how long it will take to fetch the data into the operand registers specified by a particular load operation. Processors typically handle this uncertainty by delaying execution until the fetched data is returned by stalling the execution pipeline. This stalling is inconsistent with high speed, multiple instruction per cycle processing.
In a pipelined hierarchical cache system that generates multiple cache accesses per clock cycle, coordinating data traffic is problematic. A cache line fill operation, for example, needs to be synchronized with the return data, but the lower level cache executing the line fill operation cannot predict when the required data will be returned. One method of handling this uncertainty in prior designs is by using "blocking" cache that prohibits or blocks cache activity until a miss has been serviced by a higher cache level or main memory and the line fill operation completed. Blocking cache stalls the memory pipeline, slowing memory access and reducing overall processor performance.
On the other hand, where one or more levels are non-blocking each cache level is unaware of the results of the accesses (i.e., hit or miss) or the resources available at the next higher level of the hierarchy. In a non-blocking cache, a cache miss launches a line fill operation that will eventually be serviced, however, the cache continues to allow load/store request from lower cache levels or registers. To complete cache operations such as a line fill after a miss in a non-blocking cache, each cache level must compete with adjacent levels attention. This requires that data operations arbitrate with each other for the resources necessary to complete an operation. Arbitration slows cache and hence processor performance.
Prior non-blocking cache designs include circuitry to track resources in the next higher cache level. This resource tracking is used to prevent the cache from accessing the higher level when it does not have sufficient resources to track and service the access. This control is typically implemented as one or more counters in each cache level that track available resources in the adjacent level. In response to the resources being depleted, the cache level stalls until resources are available. This type of resource tracking is slow to respond because the tracking circuitry must wait, often several clock cycles, to determine if an access request resulted in a hit or miss before it can count the resources used to service a cache miss.
What is needed is a cache architecture and a method for operating a cache subsystem that controls a hierarchical non-blocking cache and is compatible with high speed processing and memory access.