1. Field of the Invention
The present invention relates in general to microprocessors and, more particularly, to a system, method, and microprocessor architecture providing a cache throttle in a non-blocking hierarchical cache.
2. Relevant Background
Modern processors, also called microprocessors, use techniques including pipelining, superpipelining, superscaling, speculative instruction execution, and out-of-order instruction execution to enable multiple instructions to be issued and executed each clock cycle. As used herein the term processor includes complete instruction set computers (CISC), reduced instruction set computers (RISC) and hybrids. The ability of processors to execute instructions has typically outpaced the ability of memory subsystems to supply instructions and data to the processors, however. Most processors use a cache memory system to speed memory access.
Cache memory comprises one or more levels of dedicated high-speed memory holding recently accessed data, designed to speed up subsequent access to the same data. Cache technology is based on a premise that programs frequently re-execute the same instructions. When data is read from main system memory, a copy is also saved in the cache memory, along with an index to the associated main memory. The cache then monitors subsequent requests for data to see if the information needed has already been stored in the cache. If the data had indeed been stored in the cache, the data is delivered immediately to the processor while the attempt to fetch the information from main memory is aborted (or not started). If, on the other hand, the data had not been previously stored in cache then it is fetched directly from main memory and also saved in cache for future access.
Typically, processors support multiple cache levels, most often two or three levels of cache. A level 1 cache (L1 cache) is usually an internal cache built onto the same monolithic integrated circuit (IC) as the processor itself. On-chip cache is the fastest (i.e., lowest latency) because it is accessed by the internal components of the processor. On the other hand, off-chip cache is an external cache of static random access memory (SRAM) chips plugged into a motherboard. Off-chip cache has much higher latency, although is typically much shorter latency than accesses to main memory.
Given the size disparity between main system memory (which may be tens or hundreds of megabytes) and cache memory (which is typically less than one megabyte), certain rules are used to determine how to copy data from main memory to cache as well as how to make room for new data when a cache is full. In direct mapped cache, the cache location for a given memory address is determined from the middle address bits. In other words, each main memory address maps to a unique location in the cache. Hence, a number of different memory addresses will map to the same cache location. In a fully associative cache, data from any main memory address can be stored in any cache location. Each cache line is indexed by a "tag store" that holds a "tag" generated, for example, by hashing the memory address that it indexes. All tags are compared simultaneously (i.e., associatively) with a requested address, and if one tag matches, then its associated data is accessed. This requires an associative memory to hold the tags which makes this form of cache expensive.
Set associative cache is essentially a compromise between direct mapped cache and a fully associative cache. In a set associative cache, each memory address is mapped to a certain set of cache locations. An N-way set associative cache allows each address to map to N cache locations (for example, four-way set associative allows each address to map to four different cache locations). In other words, in a four-way set associative cache, each tag maps to four possible cache locations in a set. Upper address bits in the requested address will uniquely identify which item in the set the tag is referencing.
Modern processors pipeline memory operations to allow a second load operation to enter a load/store stage in an execution pipeline before a first load/store operation has passed completely through the execution pipeline. Typically, a cache memory that loads data to a register or stores data from the register is outside of the execution pipeline. When an instruction or operation is passing through the load/store pipeline stage, the cache memory is accessed. If valid data is in the cache at the correct address a "hit" is generated and the data is loaded into the registers from the cache. When requested data is not in the cache, a "miss" is generated and the data must be fetched from a higher cache level or main memory. The latency (i.e., the time required to return data after a load address is applied to the load/store pipeline) of higher cache levels and main memory is significantly greater than the latency of lower cache levels.
In a pipelines hierarchical cache system that generates multiple cache accesses per clock cycle, coordinating data traffic is problematic. A cache line fill operation, for example, needs to be synchronized with the return data, but the lower level cache executing the line fill operation cannot predict when the required data will be returned. As a result, the cache may "thrash". When a first access to a given cache line results in a miss the access is sent on to be serviced by a higher cache level or main memory. When the first access is filled, the cache line becomes valid. In typical cache structures, after the cache line becomes valid it is forwarded to a lower cache level or device that generated the first access. A thrash occurs when a second access to the same cache line reaches the cache before the valid data is forwarded to a lower cache level. The second access can overwrite the valid first data thereby preventing the first data access from being serviced. In some cases, this results in the first access being repeated, thereby invalidating the original second access. Forward progress is prevented as the first and second accesses overwrite each other. Thrashing is complicated in a set-associative cache design because multiple in flight references can be mapped to the same tag entry.
One method of handling thrashing in prior designs is by using "blocking" cache that prohibits or blocks cache activity until a miss has been serviced by a higher cache level or main memory and the line fill operation completed. In this case, the second access is stalled until the first access is complete, and the second access (to the same cache line) will hit in the cache. However, blocking cache stalls the memory pipeline, slowing memory access and reducing overall processor performance.
On the other hand, where one or more levels are non-blocking, each cache level is unaware of the results of the accesses (i.e., hit or miss) or the resources available at the next higher level of the hierarchy. In a non-blocking cache, a cache miss launches a line fill operation that will eventually be serviced, however, the cache continues to allow load/store request from lower cache levels or functional units in a processor. To prevent thrashing, prior designs include a "transit bit" for each cache entry, usually implemented in the cache tag. The transit bit is set while an access is "in flight" (i.e., after being sent up to a higher cache level or main memory, but before the data has returned to fill and validate the cache line).
Using the transit bit, a second access to the same cache line can detect when a thrash would occur, and either find another tag against which to reference this access (if available) or stall the processor until a tag becomes available. By finding another tag, the effect is to allocate a second cache line to hold the returned data, from the second access to prevent thrashing. When the processor is stalled, memory access is slowed and overall processor performance is reduced.
What is needed is a cache architecture and a method for operating a cache subsystem that tolerates or inhibits thrashing in a hierarchical non-blocking cache and is compatible with high speed processing and memory access.