1. Field of the Invention
This invention relates in general to cache memory, and more particularly, to hierarchical cache memory designs using multiple levels of non-blocking caches having distributed control in a microprocessor.
2. Relevant Background
To improve overall performance, processors use techniques including pipelining, superscalar execution, speculative instruction execution, and out-of-order instruction issue to enable multiple instructions to be issued and executed each clock cycle. As used herein the term processor includes complex instruction set computers (CISC), reduced instruction set computers (RISC), and hybrids thereof.
The ability of processors to execute instructions has typically outpaced the ability of memory subsystems to supply instructions and data to the processors, however. Most processors use a cache memory system to speed memory access.
Cache memory comprises one or more levels of dedicated high-speed memory holding recently accessed data, designed to speed up subsequent access to the same data. Cache technology is based on the premise that programs frequently re-execute the same instructions. When data is read from main memory, a copy is also saved in the cache, along with an index to the associated main memory. The cache then monitors subsequent requests for data to see if the information needed has already been stored in the cache. If the data had indeed been stored in the cache (i.e., a "hit"), the data is delivered immediately to the processor and the attempt to fetch the information from main memory is aborted (or not started). If, on the other hand, the data had not been previously stored in cache (i.e., a "miss") then it is fetched directly from main memory and also saved in the cache for future access.
Typically, processors support multiple cache levels, most often two or three levels of cache. A level 1 cache (L1 cache or L1$) is usually an internal cache built onto the same monolithic integrated circuit (IC) as the processor itself. On-chip cache is the fastest (i.e., lowest latency) because it is accessed by the internal components of the processor. On the other hand, off-chip cache is an external cache of static random access memory (SRAM) chips plugged into a motherboard. Off-chip cache has much higher latency, although it is typically much shorter latency than accesses to main memory.
Given the size and access time disparity between main system memory (which may, for example, be hundreds of thousands of megabytes) and cache memory (which can be, for example, a few megabytes), certain rules are used to determine how to copy data from main memory to cache as well as how to make room for new data when a cache is full. In a direct mapped cache, the cache location for a given memory address is determined from the middle address bits. In other words, each main memory address maps to a unique location in the cache. Hence, a number of different memory addresses will map to the same cache location. In a fully associative cache, data from any main memory address can be stored in any cache location. Each cache line is indexed by a "tag store" that holds a "tag" generated, for example, by hashing the memory address that it indexes. All tags are compared simultaneously (i.e., associatively) with a requested address, and if one tag matches, then its associated data is accessed. This requires an associative memory to hold the tags that makes this form of cache expensive.
Set associative cache is essentially a compromise between direct mapped cache and a fully associative cache. In a set associative cache, each memory address is mapped to a certain set of cache locations. An N-way set associative cache allows each address to map to N cache locations (for example, four-way set associative allows each address to map four different cache locations). In other words, in a four-way set associative cache, each tag maps to four possible cache locations in a set. Upper address bits in the requested address will uniquely identify which item in the set the tag is referencing.
Superscalar processors achieve higher performance by executing many instructions simultaneously. These instructions generate multiple numbers of memory loads or stores per cycle. Conventional processors use several techniques to allow coherent and parallel access to the cache and memory hierarchy. One technique, used commonly at the lowest level of cache access, provides duplicate copies of the cache. Each cache copy doubles the chip area consumed as compared to a single cache copy. Increased size also tends to limit clock speeds, so this technique is limited to small caches and typically enables only two cache copies and two accesses per cycle.
Another technique involves using high speed circuitry to allow two or more accesses per processor clock cycle. This approach assumes that the processor clock is sufficiently slow that the cache clock can be increased. In practice, however, the processor performance demands force the processor clock to be increased such that the ratio of processor clock to cache clock fails to allow significant advantage to this technique.
A similar technique is to provide multiple banks with each bank serving a particular set of main memory addresses. While this technique is adaptable to larger cache sizes, it too has limited scalability. Multi-bank caches, like duplicate caches, tend to limit clock speeds. Multiple banks are successfully used to enable multiple accesses per clock cycle, but have performance limits caused by addresses conflicts. Address conflicts arise when two cache accesses are attempting to access the same bank.
In a pipelined hierarchical cache system that generates multiple cache accesses per clock cycle, coordinating data traffic between the different cache levels is problematic. For example, when a first access to a given cache line results in a miss, the access is sent on to be serviced by a higher cache level or main memory. When the first access is completed, the cache line becomes valid. In typical cache organizations, after the cache line becomes valid, it is forwarded to a lower cache level or device that generated the original access. The cache line fill operation needs to be synchronized with the return data, but the lower level cache executing the line fill operation cannot predict when the required data will be returned.
"Blocking" cache designs prohibit or "block" cache activity until a miss has been serviced by a higher cache level or main memory, and the line fill operation is completed. In this case, subsequent cache accesses are stalled until the first missed access is complete. One drawback of a blocking cache is that the memory pipeline will be stalled while the cache miss is serviced, slowing memory access and reducing overall processor performance.
On the other hand, when one or more levels of the cache memory subsystem are "non-blocking", each cache level is unaware of the results of the accesses (i.e., hit or miss) at the next higher level of the hierarchy. In a non-blocking cache, a cache miss generates a line fill operation that will eventually be serviced, however, the cache continues to allow access requests from lower cache levels or functional units in a processor.
In the prior art, a first miss to a cache can force the processor to wait until the miss has been completely serviced. In a heavily pipelined, superscalar issue processor having multiple functional units executing several instructions per cycle, it is possible to have multiple instructions in flight in the machine at any time. Typically, approximately 35% of all operations in a modern computer are memory operations. It is possible that several of these memory operations may have produced cache misses, thereby saturating the resources within the cache memory subsystem. Therefore, handling overflow conditions within a multi-level, non-blocking cache hierarchy can be problematic.
Further, with multi-level cache designs, management and coordination of the interactions within and between cache levels can be complex and problematic.
What is needed is an architecture and a method for operating a hierarchical non-blocking cache memory subsystem which is compatible with high speed instruction processing and memory access.